GROUP BY

GROUP BY

Now that you have learned to use an aggregate function like COUNT, it’s time to combine that with a SPARQL function called GROUP BY.
GROUP BY aggregates data in your dataset, making it easy to run functions like COUNT on groups of data for analysis .

Take for example the query retrieving the number of tennis players on Wikidata. In the following query we add a variable – ?gender – and ask WDQS to show the results grouped by this variable.

Note that the variable used in the GROUP BY modifier must be listed in the SELECT clause.

You may notice that if you sum the count of items of all groups it is lower than the total number counted in the original query (without the gender grouping). This is the case because there are in fact a few tennis players for which no gender information is given. (See OPTIONAL clause)

Exercise: GROUP BY

| Exercise: Write a query that selects films that were nominated for an award, and the number of nominations per film.

Hint: Find films that have a P1411 (nominated for) property.

Tip: Make sure that your query doesn’t only retrieve films, but also items that are subclasses of films. Use DISTINCT to remove duplicates from the results

show solution

Note that the DISTINCT modifier is applied to the award, not to the item. If an item has more than one statement with P31 (instance of), and it has more than one path to the film class (e.g., an item is both a film, and a 3D film, which is a subclass of film), DISTINCT will ensure the award nominations are counted only once.

If you want to see which film has had the most award nominations, scroll to the results after running the query, and click on the “count” column at the top of the table, which will order the results according to this column. In the next section you will learn how to order the results in the query itself.

ORDER BY

ORDER BY

ORDER BY is a modifier that sorts the results according to a certain variable or expression. The order specified can be either ascending (ASC) or descending (DESC).

Take for example the next query which shows the number of award nominations per film, in descending order:

HAVING

HAVING

HAVING is a modifier that filters groups that do not meet a specified condition. It is always used in combination with GROUP BY.

For example, we can edit the query which shows the number of award nominations per film, so it shows only films that have had more than 5 nominations.

Note that the HAVING modifier must come directly after GROUP BY.

LIMIT

LIMIT

The LIMIT modifier can be used after the WHERE clause to return a specific number of results.

For example, the following query shows the 10 highest-grossing films that are part of a series.

Note that the GROUP BY modifier only works with an aggregate function (such as COUNT). The modifiers ORDER BY and LIMIT do not require an aggregate function.

Qualifiers

Qualifiers

Let’s look again at the query about films with more than 5 nominations.

# Films that were nominated for more than 5 awards, and the count of nominations.

SELECT ?item ?itemLabel (COUNT(DISTINCT ?award) AS ?count)
WHERE {
  ?item wdt:P31/wdt:P279* wd:Q11424.     # Item is instance of a film or subclass thereof
  ?item wdt:P1411 ?award.                # Item was nominated for an award, collected by ?award
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
  }
GROUP BY ?item ?itemLabel                # The results are shown per film
HAVING (?count>5)                        # Only films with more than 5 nominations
ORDER BY DESC(?count)                    # In descending order

This query counts awards of any kind. How can we edit the query to show the films with most Academy award nominations?

We could specify that the award should be an instance of an Academy award:

However, some films could be nominated twice for the same award. For example, the film “All About Eve” (Q200299) had two nominations for Best Actress award:

If our query only relies on the “nominated for” property and its value then the two nominations would be identical statements and therefore counted only once.

If you have already learned how to query statements with qualifiers then you know that in the Wikidata data model, for every direct property linking an item and a value, there is also a simple property (p) that connects the item to a statement node. For each award nomination there is a different statement node. In order to retrieve the correct number of Academy award nominations per film the query needs to count for each film how many “nominated for” statement nodes there are for which the the value (i.e., the award) is an Academy award:

Now the query counts the award statements of awards that are an instance of an Academy award. Note that the nominees themselves are qualifiers to each award statement, but are not relevant to the count of the query.

Exercise

| Exercise: Which actresses were nominated for more than 3 Academy awards, and how many nominations did each have?

show solution

One way to approach this question is to retrieve items of women who have “actor” as occupation (or a subclass thereof, such as “film actor”), and count the number of Academy Awards they have been nominated for. Note, however, that this includes all Academy Awards.

For example, actress Emma Thompson was also nominated for the Academy Award for Best Writing, Adapted Screenplay.

Another approach is to look at the winners of the “Academy Award for Best Actress” and “Academy Award for Best Supporting Actress”:

Excluding results

FILTER – excluding results

So far we saw modifiers that operated at the SELECT clause or after the WHERE clause. Another useful modifier is FILTER, which can appear within the WHERE clause.
We will demonstrate filtering in two ways: the FILTER function which selects only the items that match the filter condition (i.e., excludes items that do not) and FILTER NOT EXISTS which is a clause that excludes results that match certain patterns.

FILTER function

The FILTER function excludes items based on the expression that appears within the function’s parenthesis: only those results for which the expression in the parenthesis is true are used.

Let’s say we wanted to edit the query about the 10 highest-grossing films that are part of a series, so that the query will return only those films that were released before the year 2000.

On line 8, we added a match pattern to collect the value of P577 property (publication date) with the ?date variable. On the next line we used FILTER to only include items whose publication date is before 01-01-2000.

Note that we put the criterion date (2000-01-01) in quotes and add “^^xsd:dateTime” after to signify that the format of this value is a date.

YEAR function

Another way of writing this query is using the function YEAR. This built-in function returns the year part of a date (or a dateTime value):

FILTER NOT EXISTS

Let’s say we wanted to edit the query about actresses nominated for more than three Academy awards, so that the query will return only those actresses that never received an Academy award.

We add the FILTER NOT EXISTS clause, in which we specify the patterns that should NOT be matched, namely that the actress has a P166 property (award received) that has a value that is an instance (P31) of Academy Awards (Q19020):

In other words, the query excludes items that have statements that match the pattern given within the FILTER NOT EXISTS clause.

Exercise: FILTER

| Exercise: Write a query that shows sovereign states whose head of government was born in 1980 or later

Hint: select sovereign states (Q3624078) that have a head of government (P6) whose date of birth (P569) has the year 1980 or later, or who have their birthday 01-01-1980 or later.

show solution

Another way of writing this query is using the exact date format:

Skip to content