Nothing Special   »   [go: up one dir, main page]

Link Search Menu Expand Document
Start for Free

Stored Query Service

This page discusses the Stored Query Service which enables users to run stored queries as subqueries.

Page Contents
  1. Overview
  2. Path Subqueries
  3. Correlated Subqueries
  4. Limit and offset modifiers
  5. Defining Dataset for Subqueries

Overview

Stardog supports a way to invoke stored queries, including Path Queries in the context of another SPARQL query using the SERVICE keyword. The Stored Query Service (SQS) was released as beta in Stardog 7.3.2 and is generally available (GA) as of version 7.4.0. Previous versions of Stardog already employed the service mechanism in SPARQL to support Full-Text Search and Entity Extraction and now this is naturally extended to stored queries. Suppose, the following query is stored with the name “cities”:

$ stardog-admin stored add -n "cities" "SELECT ?country ?city { ?city :locatedIn ?country }"

Then it is possible to use it as a named subquery in another query:

prefix sqs: <tag:stardog:api:sqs:>

SELECT ?person ?city ?country {
    SERVICE <query://cities> { [] sqs:vars ?country, ?city }
    ?person :from ?city
}

This query uses the “cities” query to look up information about the country given the city where a person lives. It is similar to using a Wikidata endpoint or an explicit subquery except that the subquery is referenced by name. The same query with an explicit subquery would look like this:

SELECT ?person ?city ?country {
    {
        SELECT ?country ?city {
            ?city :locatedIn ?country
        }
    }
    ?person :from ?city
}

Invoking stored queries by name has the major benefit that it avoids duplication of their query strings. Stored queries become reusable query building blocks maintained in one place rather than copy-pasted over the many queries which use them.

The body pattern of SERVICE <query://name> { ... } specifies which variables of the stored query are used in the outer scope of the calling query. The sqs:vars is a shortcut which is useful when stored query variables retain their names. However it’s possible to map stored query variable names to other identifiers to avoid naming conflicts:

prefix sqs: <tag:stardog:api:sqs:>

SELECT ?person ?city ?livesIn ?country {
    SERVICE <query://countries> {
        []  sqs:var:city ?livesIn ;
            sqs:var:country ?country
    }
    ?person :from ?livesIn ;
            :born ?city
}

Furthermore, it’s possible to statically bind some stored query variables to constants so the query would behave like a parameterized view:

prefix sqs: <tag:stardog:api:sqs:>

SELECT ?city ?country {
    SERVICE <query://countries> {
        []  sqs:var:city ?city ;
            sqs:var:country :The_United_States
    }
}

Path Subqueries

Another interesting feature is the ability to call path queries from SELECT/CONSTRUCT/ASK queries. One cannot directly use a path query in a subquery because those do not return SPARQL binding sets, aka solutions (we discussed that issue in an earlier blog post on Extended Solutions). However, this service circumvents that restriction:

prefix sqs: <tag:stardog:api:sqs:>

SELECT ?start (count(*) as ?paths) {
    SERVICE <query://paths> {
        [] sqs:vars ?start
    }
} GROUP BY ?start

The stored path query returns paths (according to some VIA pattern) and uses ?start as the start node variable. The main query aggregates the returned paths by the start node and returns the number of paths for each. In contrast to the earlier SELECT example, this would not be possible directly because path queries cannot be used as subqueries.

One should be aware of the potential explosive nature of path queries when using them through the stored query service. They can return a very high number of paths to be joined or aggregated and thus create substantial memory pressure on the server.

Stardog supports a set of SPARQL functions in the tag:stardog:api:functions: namespace which take paths as arguments. They are summarized in the following table:

Local name Arguments Returned value Stardog version
:nodes path an array of all path nodes 7.3.2+
:length path the length of the path 7.3.2+
:any path, boolean expression true, if the expression holds for at least one edge in the path 7.4.4+
:all path, boolean expression true, if the expression holds for all edges in the path 7.4.4+
:reduce path, initial value, accumulator the result of aggregation along the path starting from the initial value 7.9.0+

Below are some examples how the functions can be used in SPARQL. stardog:length usage is straightforward:

prefix sqs: <tag:stardog:api:sqs:>
prefix stardog: <tag:stardog:api:>

# return the average length of paths grouped by the start node
SELECT ?start (avg(stardog:length(?path)) as ?avg_length) {
    SERVICE <query://paths> {
        [] sqs:vars ?start, ?path
    }
} GROUP BY ?start

:all and :any are second-order functions which evaluate the boolean predicate over path edges:

SELECT (str(stardog:nodes(?path)) as ?nodes) {
    SERVICE <query://paths> {
        [] sqs:vars ?path
    }
    FILTER(stardog:all(?path, ?attribute = 10))
}

Here ?attribute is a variable occurring in the VIA pattern of the stored path query. stardog:all returns true if the ?attribute = 10 condition is true for all edges in the path. The second argument can be an arbitrary SPARQL expression. stardog:any is the complementary function returning true if the condition is true for at least one edge. It is particularly useful for querying paths which must pass through a particular node(s) in the graph.

:reduce is a generalization of both :all and :any and corresponds to the well-known “reduce” (aka “fold”) function in functional programming languages. Evaluation of :reduce(path, initial, accumulator) is equivalent to the following pseudo-code:

value result = initial
for (edge in path) {
  Solution input = add(edge, ?_ = result)
  result = accumulator(input)
}
return result

That is, the accumulator expression is evaluated on each edge of the path with one extra variable binding that holds the partial result. The name of the partial result variable does not matter but by convention it is the first variable that occurs in the accumulator expression. Consider the example where ?_ is used for the partial result variable name:

SELECT (str(stardog:nodes(?path)) as ?nodes) ?path_average {
    SERVICE <query://paths> {
        [] sqs:vars ?path
    }
    BIND(stardog:reduce(?path, 0, ?_ + ?attribute) as ?path_total)
    BIND(?path_total / stardog:length(?path) as ?path_average)
}

Here again ?attribute is a variable bound in the VIA pattern while :reduce and :length are used to compute the mean ?attribute value for each path. Other aggregations over paths, e.g. max or min, can be computed similarly. Note that :reduce is strictly more expressive than :all, :any, and :length functions, for example, :any(?path, ?attribute = 10) is equivalent to :reduce(?path, false, ?_ || (?attribute = 10)).

Correlated Subqueries

By default, evaluation of subqueries referenced through SQS is subject to the standard bottom-up SPARQL semantics. Specifically, they are evaluated once and their results are joined with other query patterns in the same scope (or Group Graph Pattern, or {} in SPARQL). In other words, just as for standard subqueries in SPARQL, the evaluation is uncorrelated as the subquery cannot use values of variables from the outer query. Consider the following example:

SELECT * {
  ?person :hasAge ?age
  FILTER (?age > ?majorityAge) 
}

This query is parameterized on the value of ?majorityAge and is meant to select all adults, i.e. people whose age exceeds their respective age of majority. Since majorityAge is not bound anywhere in this query, running it as-is will return empty results on any data. Thus using it as a standard, uncorrelated subquery will never achieve the intended results, as in the following example (the filter in the subquery will never evaluate to true):

# this won't return desired results!
SELECT ?country (count(?person) as ?c) {
  {
    SELECT * {
      ?person :hasAge ?age
      FILTER (?age > ?majorityAge)
    }
  }
  ?country :hasMajorityAge ?majorityAge  
} GROUP BY ?country

Overcoming this limitation with the standard SPARQL subqueries requires workarounds, such as moving the filter outside of the subquery or doing the loop over countries in a separate query. This is often inconvenient and may cause loss of performance (for example, enabling the Literal Index could make the filter in the subquery more efficient if ?majorityAge is bound). These issues are well-known and traditionally addressed by correlated subqueries which are executed once per each tuple of values of variables bound in the outer query.

SQS provides a way to indicate that the (stored) subquery is correlated on particular variables. This is done using the sqs:inputs predicate in the SERVICE pattern:

SELECT ?country (count(?person) as ?c) {
  SERVICE <query://persons-by-age> {
    [] sqs:inputs ?majorityAge ;
       sqs:vars ?person
  }
  ?country :hasMajorityAge ?majorityAge  
} GROUP BY ?country

Now the query engine will execute the stored subquery for each value of ?majorityAge generated by the outer triple pattern, i.e. ?country :hasMajorityAge ?majorityAge. This is visible in the query plan where the outer pattern is now an argument of a ServiceJoin operator which uses it to assign values to input variables (?majorityAge) before each execution of the service.

Group(by=[?country] aggregates=[(COUNT(*) AS ?c)]) [#1]
`─ ServiceJoin [#5.0K]
   +─ StoredQuery(persons-by-age: (?majorityAge) -> (?majorityAge, ?person)) {
   │  +─ Projection(?person, ?age, ?majorityAge)
   │  +─ `─ Filter(?age > ?majorityAge)
   │  +─    `─ Scan[SPO](?person, :age, ?age) [#1]
   │  }
   `─ Scan[PSO](?country, <http://api.stardog.com/majorityAge>, ?majorityAge) [#1]

Another classical example where correlated execution is essential is Top-K subqueries. Consider the following simple example: given payroll data, find three highest paid employees in each department. The naive attempt to do it in pure SPARQL will fall short of the goal:

# this won't return desired results!
SELECT ?dept ?emp {
  ?dept a :Department
  { SELECT ?emp {
      ?emp :worksIn ?dept ;
           :salary ?salary 
    } ORDER BY desc(?salary) LIMIT 3
  }
} ORDER BY ?dept

The issue is again that the subquery is executed once and independently of the rest of the query. It will return the three highest paid employees across all departments. The intention is of course to execute the subquery for each department after binding ?dept to a value matched by the outer query.

This can be achieved using SQS as follows:

SELECT ?dept ?emp {
  ?dept a :Department
  SERVICE <query://employees> {
    [] sqs:inputs ?dept ; sqs:vars ?emp
  }
} ORDER BY ?dept

As of version 7.7.2, designating variables as inputs has an effect of implicitly treating all non-input variables as outputs. These output variables can be inputs to other service queries but they can not be bound by services’s argument. The following query will fail:

# Unable to bind input variables for Stored Query Service call
SELECT ?dept ?emp {
  ?emp :worksIn ?dept
  SERVICE <query://employees> {
    [] sqs:inputs ?dept ; sqs:vars ?emp
  }
} ORDER BY ?dept

Just as for other variables, it’s possible to map input variables in the subquery to other main query’s variables, like ?dept to ?department in the following example. The engine will bind ?dept to the current value of ?department before each execution.

SELECT ?dept ?emp {
  ?department a :Department
  SERVICE <query://employees> {
    [] sqs:inputs ?department ; sqs:var:dept ?department ; sqs:vars ?emp
  }
} ORDER BY ?dept

Correlated execution is also supported for path subqueries. There it is particularly important given the potentially high number of returned paths if the subquery runs without inputs. In the following example the paths subquery is executed for each value of the start variable:

prefix sqs: <tag:stardog:api:sqs:>
prefix stardog: <tag:stardog:api:>

SELECT ?start (str(stardog:nodes(?path)) as ?pstr) {
   VALUES ?start { :X :Y :Z }
   SERVICE <query://paths> {
     [] sqs:inputs ?start ; sqs:vars ?path
   }
} GROUP BY ?start

It is similarly possible to indicate that variables appearing inside the VIA, START, or END patterns of a paths subquery are inputs.

Limit and offset modifiers

Another common theme for correlated execution is restricting the number of results returned for each input pattern. This can be easily done by applying LIMIT to a stored query - however in this case running it with another limit would require storing another query. Starting with with version 7.7.2 it is possible to specify a limit (and offset) for an existing query:

prefix sqs: <tag:stardog:api:sqs:>
prefix stardog: <tag:stardog:api:>

SELECT ?start (str(stardog:nodes(?path)) as ?pstr) {
   VALUES ?start { :X :Y :Z }
   SERVICE <query://paths> {
     [] sqs:inputs ?start ;
        sqs:vars ?path ;
        sqs:limit 3
   }
} GROUP BY ?start

If limit or offset is defined in the stored query itself, values passed when invoking stored query service take precedence.

Defining Dataset for Subqueries

SPARQL does not allow specifying the query dataset for subqueries i.e. one cannot use FROM or FROM NAMED keywords in a subquery. Subqueries inherit the dataset from the main query. SQS enables two ways of defining dataset for a subquery (in the order of precedence):

  • directly in the SQS pattern using sqs:default-graph and sqs:named-graph predicates.
  • inside the stored query using the standard FROM and FROM NAMED keywords.

If none of the above is used, the stored subquery will inherit the dataset from the main query.

Both sqs:default-graph and sqs:named-graph predicates can be used to specify multiple graphs to define the default and the named part of the query dataset (just like FROM and FROM NAMED keywords can be used multiple times):

SELECT * {
  SERVICE <query://name> {
    [] sqs:default-graph :g1, :g2 ; 
       sqs:named-graph :g3, :g4
  }
}

It is possible to use Named Graph Aliases both in stored subqueries and in the range of sqs:default-graph and sqs:named-graph predicates.