Learn SPARQL in 5 minutes and use it to query WikiData
- #Analytics

I'm working on Entity Linking and Knowledge Bases. In that context, exporting a relevant part of Wikidata can be really useful to build surface form dictionaries and coocurence probabilities etc... In order to know which part of Wikidata is relevant to dump, I thought we could query Wikidata (although it seems we can only download the entire dump and filter afterwards).
SPARQL is a language to formulate questions (queries) for knowledge databases. Therefore you can query Wikidata with SPARQL. At first sight, the syntax is not particularly easy and I've gone through this tutorial.
SparQL in 5 minutes๐
#is the comment character- The
SELECTclause lists variables that you want returned (variables start with a question mark) - The
WHEREclause contains restrictions on them, in the form of SPO triples (subject, predicate, object), e.g.?fruit hasColor yellow. - On Wikidata, items and properties are not identified by human-readable names like โhasColorโ (property) or โyellowโ (item).
- Instead, Wikidata items and properties are assigned an identifier, that you need to know beforehand.
- you can search for the itentifiers using
search termfor items andP:search termfor properties - but you should rely on autocompletion in query.wikidata.org by pressing Ctrl + Space
- Finally, you need to include prefix namespaces to query the WQDS (Wikidata Query Service). There are many prefixes, one for each namespace in SPARQL :
wd:for itemswdt:for properties, pointing to the object
- to doublecheck what prefix links to what resource, use https://prefix.cc/
- to get more than the Wikidata ID as selectable attributes, you need to include them in the WHERE clause using
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
Putting this all together we get:
SELECT ?fruit ?fruitLabel
WHERE
{
# fruit hasColor yellow
?fruit wdt:P462 wd:Q943
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}- You can further filter this down by adding more triple conditions using
;character e.g. you could filter for actual fruits by doing
# fruit instance of or subclass of a fruit
?fruit wdt:P31/wdt:P279* wd:Q3314483;- Advanced filters :
p:for properties, pointing to the subjectps:for property statementpq:for property qualifier
- You can abbreviate a lot with the
[]syntax
SELECT ?painting ?paintingLabel ?material ?materialLabel
WHERE
{
# element is a painting
?painting wdt:P31/wdt:P279* wd:Q3305213;
# extract the statement node 'material' (P186)
p:P186 [
# get material property statement
ps:P186 ?material;
# 'applies to part'(P518) 'painting surface'(Q861259)
pq:P518 wd:Q861259
].
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}- More grammar by example:
ORDER BY,LIMIT
SELECT ?country ?countryLabel ?population
WHERE
{
# instances of sovereign state
?country wdt:P31/wdt:P279* wd:Q3624078;
# hasPopulation populationValue
wdt:P1082 ?population.
# filter for english translations
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
# ASC(?something) or DESC(?something)
ORDER BY DESC(?population)
LIMIT 10- If you add more variables like population above, the query will filter out countries that don't have a population value. To fix this, use an
OPTIONALclause
SELECT ?book ?title ?illustratorLabel ?publisherLabel ?published
WHERE
{
?book wdt:P50 wd:Q35610.
OPTIONAL { ?book wdt:P1476 ?title. }
OPTIONAL { ?book wdt:P110 ?illustrator. }
OPTIONAL { ?book wdt:P123 ?publisher. }
OPTIONAL { ?book wdt:P577 ?published. }
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}FILTERandBIND, see tutorial section for more details
SELECT ?person ?personLabel ?age
WHERE
{
# instance of human
?person wdt:P31 wd:Q5;
wdt:P569 ?born;
wdt:P570 ?died;
# died from capital punishment
wdt:P1196 wd:Q8454.
BIND(?died - ?born AS ?ageInDays).
BIND(?ageInDays/365.2425 AS ?ageInYears).
BIND(FLOOR(?ageInYears) AS ?age).
FILTER(?age > 90)
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}- One can select based on a list of items using
VALUES
SELECT ?item ?itemLabel ?mother ?motherLabel
WHERE {
# A. Einstein or J.S. Bach
VALUES ?item { wd:Q937 wd:Q1339 }
# mother of
OPTIONAL { ?item wdt:P25 ?mother. }
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}- The Label Service extension automatically generates labels as follows:
?xxxLabelas a shortcut forrdfs:label?xxxAltLabelas a shortcut forskos:altLabel?xxxDescriptionas a shortcut forschema:description
Fun SPARQL queries related to football๐
Get the ๐ณ๐ฑ dutch nicknames of a team:
# get the dutch nicknames from Bayern Mรผnchen
SELECT ?item ?itemLabel ?itemDescription ?itemAltLabel
WHERE {
VALUES ?item { wd:Q15789 }
SERVICE wikibase:label { bd:serviceParam wikibase:language "nl". }
}Get the stadium names of the teams that are part of the Big 5:
SELECT ?item ?itemLabel ?venue ?venueLabel ?venueAltLabel
WHERE
{
?item wdt:P31/wdt:P279* wd:Q847017;
wdt:P118 ?league;
wdt:P115 ?venue.
# filter for Big 5
VALUES ?league { wd:Q82595 wd:Q9448 wd:Q13394 wd:Q15804 wd:Q324867 wd:Q206813}.
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
ORDER BY ?leagueSolutions to the tutorial exercices๐
Here are my solution to the exercises in that tutorial.
Chemical elements๐
Write a query that returns all chemical elements with their element symbol and atomic number, in order of their atomic number.
SELECT ?element ?elementLabel ?symbol ?atomic_number
WHERE
{
?element wdt:P31 wd:Q11344;
wdt:P246 ?symbol ;
wdt:P1086 ?atomic_number .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
ORDER BY ASC(?atomic_number)Rivers that flow into the Mississippi๐
Write a query that returns all rivers that flow directly or indirectly into the Mississippi River.
SELECT ?river ?riverLabel
WHERE
{
?river wdt:P31 wd:Q4022;
wdt:P403/wdt:P403* wd:Q1497 .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
ORDER BY ASC(?riverLabel)References to Le Figaro website๐
SELECT ?ref ?refURL WHERE {
?ref pr:P854 ?refURL .
FILTER (CONTAINS(str(?refURL),'lefigaro.fr')) .
} LIMIT 10Now that you have developed a SparQL query, here is the simplest way to programatically query WikiData with python:
pandas
requests"""SPARQL utils."""
from pathlib import Path
from typing import List
from urllib.parse import urlparse
import pandas as pd
import requests
def query_wikidata(sparql_file: str, sparql_columns: List[str]) -> pd.DataFrame:
"""Query Wikidata SPARQL API endpoint."""
wikidata_api = "https://query.wikidata.org/sparql"
query = Path(sparql_file).read_text()
r = requests.get(wikidata_api, params={"format": "json", "query": query})
data = r.json()
df = (
pd.json_normalize(data, record_path=["results", "bindings"])
.rename(columns={c + ".value": c for c in sparql_columns})[sparql_columns]
.assign(q_id=lambda d: d.item.apply(lambda u: Path(urlparse(u).path).stem))
)
return df