Modifying results – Wikidata Query Service Tutorial

Modifying your results

Posted on February 6, 2020November 25, 2020 by admin

Modifying your results

In all the queries we have seen until now, all items that matched certain patterns were selected and shown in the results. But quite often we are only interested in certain results, or some aspect of them. In this section you will learn a few keywords and functions that will help you get more out of your data.

Our queries so far had two clauses: the SELECT clause where we stated the variables that will be shown in the results, and the WHERE clause in which the patterns for matching were given.

#Star Wars Films
SELECT ?item  
WHERE 
{
  ?item wdt:P179 wd:Q22092344.
}

In the next section you will learn to modify your query results by adding keywords and functions to the SELECT clause, to the WHERE clause, or after the WHERE clause. In fact, you may have already encountered such an element in previous sections – the DISTINCT modifier which can be added to the SELECT clause to remove duplicates from the results.

COUNT

Posted on February 9, 2020November 25, 2020 by admin

COUNT

The aggregate function COUNT can be used to write queries to answer how many items there are that match a certain pattern. In such cases, we are not interested in the items that match our query pattern themselves, but just want to know the number of the items that match.

Suppose we wanted to know how many Wikidata items there are about women chemists. In principle, we could run a query as follows:

Above the table with the results, WDQS shows how many items were found in how many milliseconds of running the query.

This strategy won’t always work, however. Suppose we wanted to know how many Wikidata items there are about actresses. In principle, we could edit the query as follows:

#Actresses on Wikidata
SELECT ?item ?itemLabel
WHERE {
  ?item wdt:P31 wd:Q5.            # Item is instance of human
  ?item wdt:P21 wd:Q6581072.      # Item is a woman
  ?item wdt:P106 wd:Q33999.       # Occupation is actor.     
  
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }  
  }

In principle, WDQS would return all matching items and show us how many items were found that match the pattern. However, because the number of matching items is very large, the query will time out. This is due to a query deadline which is set to 60 seconds. Every query that takes more time to execute than this configured deadline will time out:

To answer how many actresses there are on Wikidata we declare a variable in the SELECT clause, in this example we call it ?actresscount. The variable is defined as the COUNT of items with patterns matching those in the WHERE clause:

This query has only one result: the variable actresscount with the value of the count of all items matching the search pattern (human and female and actress)

Exercise: COUNT

Posted on February 10, 2020November 29, 2020 by admin

| Exercise: How many tennis players (Q10833314) are there on Wikidata?

show solution

Another exercise

Posted on February 10, 2020November 29, 2020 by admin

| Exercise: How many films are there on Wikidata?

Hint: Make sure that your query doesn’t only retrieve films, but also items that are subclasses of films.

Tip: use DISTINCT to remove duplicates from the results

show solution

GROUP BY

Posted on February 10, 2020November 26, 2020 by admin

GROUP BY

Now that you have learned to use an aggregate function like COUNT, it’s time to combine that with a SPARQL function called GROUP BY.
GROUP BY aggregates data in your dataset, making it easy to run functions like COUNT on groups of data for analysis .

Take for example the query retrieving the number of tennis players on Wikidata. In the following query we add a variable – ?gender – and ask WDQS to show the results grouped by this variable.

Note that the variable used in the GROUP BY modifier must be listed in the SELECT clause.

You may notice that if you sum the count of items of all groups it is lower than the total number counted in the original query (without the gender grouping). This is the case because there are in fact a few tennis players for which no gender information is given. (See OPTIONAL clause)

Exercise: GROUP BY

Posted on February 12, 2020November 29, 2020 by admin

| Exercise: Write a query that selects films that were nominated for an award, and the number of nominations per film.

Hint: Find films that have a P1411 (nominated for) property.

Tip: Make sure that your query doesn’t only retrieve films, but also items that are subclasses of films. Use DISTINCT to remove duplicates from the results

show solution

Note that the DISTINCT modifier is applied to the award, not to the item. If an item has more than one statement with P31 (instance of), and it has more than one path to the film class (e.g., an item is both a film, and a 3D film, which is a subclass of film), DISTINCT will ensure the award nominations are counted only once.

If you want to see which film has had the most award nominations, scroll to the results after running the query, and click on the “count” column at the top of the table, which will order the results according to this column. In the next section you will learn how to order the results in the query itself.

ORDER BY

Posted on February 12, 2020November 29, 2020 by admin

ORDER BY

ORDER BY is a modifier that sorts the results according to a certain variable or expression. The order specified can be either ascending (ASC) or descending (DESC).

Take for example the next query which shows the number of award nominations per film, in descending order:

HAVING

Posted on February 12, 2020November 29, 2020 by admin

HAVING

HAVING is a modifier that filters groups that do not meet a specified condition. It is always used in combination with GROUP BY.

For example, we can edit the query which shows the number of award nominations per film, so it shows only films that have had more than 5 nominations.

Note that the HAVING modifier must come directly after GROUP BY.

LIMIT

Posted on February 18, 2020November 29, 2020 by admin

LIMIT

The LIMIT modifier can be used after the WHERE clause to return a specific number of results.

For example, the following query shows the 10 highest-grossing films that are part of a series.

Note that the GROUP BY modifier only works with an aggregate function (such as COUNT). The modifiers ORDER BY and LIMIT do not require an aggregate function.

Qualifiers

Posted on February 18, 2020November 26, 2020 by admin

Qualifiers

Let’s look again at the query about films with more than 5 nominations.

# Films that were nominated for more than 5 awards, and the count of nominations.

SELECT ?item ?itemLabel (COUNT(DISTINCT ?award) AS ?count)
WHERE {
  ?item wdt:P31/wdt:P279* wd:Q11424.     # Item is instance of a film or subclass thereof
  ?item wdt:P1411 ?award.                # Item was nominated for an award, collected by ?award
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
  }
GROUP BY ?item ?itemLabel                # The results are shown per film
HAVING (?count>5)                        # Only films with more than 5 nominations
ORDER BY DESC(?count)                    # In descending order

This query counts awards of any kind. How can we edit the query to show the films with most Academy award nominations?

We could specify that the award should be an instance of an Academy award:

However, some films could be nominated twice for the same award. For example, the film “All About Eve” (Q200299) had two nominations for Best Actress award:

If our query only relies on the “nominated for” property and its value then the two nominations would be identical statements and therefore counted only once.

If you have already learned how to query statements with qualifiers then you know that in the Wikidata data model, for every direct property linking an item and a value, there is also a simple property (p) that connects the item to a statement node. For each award nomination there is a different statement node. In order to retrieve the correct number of Academy award nominations per film the query needs to count for each film how many “nominated for” statement nodes there are for which the the value (i.e., the award) is an Academy award:

Now the query counts the award statements of awards that are an instance of an Academy award. Note that the nominees themselves are qualifiers to each award statement, but are not relevant to the count of the query.