Three ways to write the same query
Let’s say you want to find all patents published in the last 10 years that have the word “dog” in their titles or abstracts, and whose assignees are located in either the US or Canada. Here are three ways you could write such a query:
- Use a string:
query_v_1 <-
'{"_and":[
{"_gte":{"patent_date":"2007-03-01"}},
{"_or":[
{"_text_all":{"patent_title":"dog"}},
{"_text_all":{"patent_abstract":"dog"}}
]},
{"_or":[
{"_eq":{"assingee_country":"US"}},
{"_eq":{"assingee_country":"CA"}}
]}
]}'
- Use a list:
query_v_2 <-
list("_and" =
list(
list("_gte" = list(patent_date = "2007-03-01")),
list("_or" =
list(
list("_text_all" = list(patent_title = "dog")),
list("_text_all" = list(patent_abstract = "dog"))
)
),
list("_or" =
list(
list("_eq" = list(assingee_country = "US")),
list("_eq" = list(assingee_country = "CA"))
)
)
)
)
- Use the
patentsview
domain specific language (DSL):
library(patentsview)
query_v_3 <-
with_qfuns(
and(
gte(patent_date = "2007-03-01"),
or(
text_all(patent_title = "dog"),
text_all(patent_abstract = "dog")
),
eq(assingee_country = c("US", "CA"))
)
)
Why use the DSL?
We can see that all three versions of the query shown above are equivalent:
jsonlite::minify(query_v_1)
#> {"_and":[{"_gte":{"patent_date":"2007-03-01"}},{"_or":[{"_text_all":{"patent_title":"dog"}},{"_text_all":{"patent_abstract":"dog"}}]},{"_or":[{"_eq":{"assingee_country":"US"}},{"_eq":{"assingee_country":"CA"}}]}]}
jsonlite::toJSON(query_v_2, auto_unbox = TRUE)
#> {"_and":[{"_gte":{"patent_date":"2007-03-01"}},{"_or":[{"_text_all":{"patent_title":"dog"}},{"_text_all":{"patent_abstract":"dog"}}]},{"_or":[{"_eq":{"assingee_country":"US"}},{"_eq":{"assingee_country":"CA"}}]}]}
jsonlite::toJSON(query_v_3, auto_unbox = TRUE)
#> {"_and":[{"_gte":{"patent_date":"2007-03-01"}},{"_or":[{"_text_all":{"patent_title":"dog"}},{"_text_all":{"patent_abstract":"dog"}}]},{"_or":[{"_eq":{"assingee_country":"US"}},{"_eq":{"assingee_country":"CA"}}]}]}
…So why would you ever want to use method 3 over methods 1 or 2? There are two main reasons:
1. Query validation
search_pv()
will check your query for errors if you use
methods 2 or 3. This is not the case for method 1, where you would have
to rely on the API’s error messages for guidance if your query is
invalid. search_pv()
checks queries for the following:
- The fields included in your query are queryable for the
endpoint (i.e., the field can be used in the user query). For example,
it would make sure that
assingee_country
can be used in the query argument if you sent the above query to the patents endpoint. - The fields in your query are compatible with the comparison
operators you used. For example, it would confirm that the
text_all
operator was used with a field whose type was “full text” (patent_title
above). - You supplied the correct value type for the field (e.g.,
patent_date
is a character, not an integer).
2. Concise, easy to use syntax for complex queries
Methods 1 and 3 are both shorter than method 2, making them quicker.
It’s also a lot easier to get the JSON syntax correct when using method
3 compared to method 1, because you don’t have to write any JSON at all
using the DSL…This is important because the API is fairly picky about
the query syntax, so it’s not trivial to get it correct. For example,
the API will throw an error if you use a box in your JSON when is not
absolutely necessary, even if your query is still valid JSON (e.g.,
query = {"_gte":{"patent_date":["2007-03-01"]}}
will throw
an error).
Compared to method 1, method 3 will correctly “or” together values if
you put them in a vector. For example, in the query shown above, a
vector of two values was given for assingee_country
(c("US", "CA")
). This safely converted the single “equals”
statement in the third element of the query
(eq(assingee_country = c("US", "CA"))
) to two separate
equals statements that got or’d together.1
Basics of the language
All of the functions that make up the DSL are found in the
qry_funs
list (e.g., qry_funs$eq()
). You can
evaluate code in the context of this list using the function
with_qfuns()
(see ?with_qfuns()
for an example
that demonstrates how with_qfuns()
can save you typing).
There are three types of functions in qry_funs
:
-
Comparison operator functions (
eq
,neq
,gt
,gte
,lt
,lte
,begins
,contains
,text_all
,text_any
,text_phrase
). These functions are used to compare a field to a value. For example, using the “less than or equal to” function (lte
), we can filter out patents published after some date (e.g.,query = qry_funs$lte(patent_date = "2001-01-05")
). See the “comparison operators” section of the API’s query language page for a description of the 11 comparison operators. One important thing to keep in mind is that certain comparison operators only work with certain data types. For example, you can’t use thebegins
function onpatent_abstract
becausepatent_abstract
is of data type “full text” andbegins
only works with fields of data type “string.” -
Array functions (
and
andor
). You can use these functions to logically combine your calls to the comparison operators. For example, we can require that the patent date is less than or equal to 2001-01-05 and the inventor’s last name is “Ihaka” (query = with_qfuns(and(lte(patent_date = "2001-01-05"), eq(inventor_last_name = "Ihaka")))
). -
not function (
not
). This function negates a comparison. For example, we could search for patents that don’t have the word “hi” in their titles like this:qry_funs$not(qry_funs$text_phrase(patent_title = "hi"))
.
Query examples
The following queries are intended for the patents endpoint
Patents linked to an assignee with 10 or fewer distinct (and disambiguated) inventors:
qry_funs$lte(assignee_total_num_inventors = 10)
#> {"_lte":{"assignee_total_num_inventors":10}}
Patents assigned to the “CPC subsection”2 of G12 (physics instruments):
qry_funs$eq(cpc_subsection_id = "G12")
#> {"_eq":{"cpc_subsection_id":"G12"}}
Patents that:
- Have an inventor listed on them whose first name contains “joh” AND
- Have an abstract containing either the phrase “dog bark” or “cat meow” AND
- Have an abstract that doesn’t have the phrase “dog chain” in it:
with_qfuns(
and(
contains(rawinventor_first_name = "joh"),
text_phrase(patent_abstract = c("dog bark", "cat meow")),
not(
text_phrase(patent_abstract = c("dog chain"))
)
)
)
#> {"_and":[{"_contains":{"rawinventor_first_name":"joh"}},{"_or":[{"_text_phrase":{"patent_abstract":"dog bark"}},{"_text_phrase":{"patent_abstract":"cat meow"}}]},{"_not":{"_text_phrase":{"patent_abstract":"dog chain"}}}]}
Patents that:
- Have an inventor listed on them whose last name is “Smith” AND
- Have “cotton gin” in their title
OR
- Have an inventor listed on them whose last name is “Hopper” AND
- Have “COBOL” in their title
with_qfuns(
or(
and(
eq(inventor_last_name = "smith"),
text_phrase(patent_title = "cotton gin")
),
and(
eq(inventor_last_name = "hopper"),
text_phrase(patent_title = "COBOL")
)
)
)
#> {"_or":[{"_and":[{"_eq":{"inventor_last_name":"smith"}},{"_text_phrase":{"patent_title":"cotton gin"}}]},{"_and":[{"_eq":{"inventor_last_name":"hopper"}},{"_text_phrase":{"patent_title":"COBOL"}}]}]}