Learn SPARQL in 5 minutes and use it to query WikiData
- #Data
I'm working on Entity Linking and Knowledge Bases. In that context, exporting a relevant part of Wikidata can be really useful to build surface form dictionaries and coocurence probabilities etc... In order to know which part of Wikidata is relevant to dump, I thought we could query Wikidata (although it seems we can only download the entire dump and filter afterwards).
SPARQL
is a language to formulate questions (queries) for knowledge databases. Therefore you can query Wikidata with SPARQL
. At first sight, the syntax is not particularly easy and I've gone through this tutorial.
SparQL in 5 minutes๐
#
is the comment character- The
SELECT
clause lists variables that you want returned (variables start with a question mark) - The
WHERE
clause contains restrictions on them, in the form of SPO triples (subject, predicate, object), e.g.?fruit hasColor yellow.
- On Wikidata, items and properties are not identified by human-readable names like โhasColorโ (property) or โyellowโ (item).
- Instead, Wikidata items and properties are assigned an identifier, that you need to know beforehand.
- you can search for the itentifiers using
search term
for items andP:search term
for properties - but you should rely on autocompletion in query.wikidata.org by pressing Ctrl + Space
- Finally, you need to include prefix namespaces to query the WQDS (Wikidata Query Service). There are many prefixes, one for each namespace in SPARQL :
wd:
for itemswdt:
for properties, pointing to the object
- to doublecheck what prefix links to what resource, use https://prefix.cc/
- to get more than the Wikidata ID as selectable attributes, you need to include them in the WHERE clause using
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
Putting this all together we get:
SELECT ?fruit ?fruitLabel
WHERE
{
# fruit hasColor yellow
?fruit wdt:P462 wd:Q943
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
- You can further filter this down by adding more triple conditions using
;
character e.g. you could filter for actual fruits by doing
# fruit instance of or subclass of a fruit
?fruit wdt:P31/wdt:P279* wd:Q3314483;
- Advanced filters :
p:
for properties, pointing to the subjectps:
for property statementpq:
for property qualifier
- You can abbreviate a lot with the
[]
syntax
SELECT ?painting ?paintingLabel ?material ?materialLabel
WHERE
{
# element is a painting
?painting wdt:P31/wdt:P279* wd:Q3305213;
# extract the statement node 'material' (P186)
p:P186 [
# get material property statement
ps:P186 ?material;
# 'applies to part'(P518) 'painting surface'(Q861259)
pq:P518 wd:Q861259
].
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
- More grammar by example:
ORDER BY
,LIMIT
SELECT ?country ?countryLabel ?population
WHERE
{
# instances of sovereign state
?country wdt:P31/wdt:P279* wd:Q3624078;
# hasPopulation populationValue
wdt:P1082 ?population.
# filter for english translations
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
# ASC(?something) or DESC(?something)
ORDER BY DESC(?population)
LIMIT 10
- If you add more variables like population above, the query will filter out countries that don't have a population value. To fix this, use an
OPTIONAL
clause
SELECT ?book ?title ?illustratorLabel ?publisherLabel ?published
WHERE
{
?book wdt:P50 wd:Q35610.
OPTIONAL { ?book wdt:P1476 ?title. }
OPTIONAL { ?book wdt:P110 ?illustrator. }
OPTIONAL { ?book wdt:P123 ?publisher. }
OPTIONAL { ?book wdt:P577 ?published. }
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
FILTER
andBIND
, see tutorial section for more details
SELECT ?person ?personLabel ?age
WHERE
{
# instance of human
?person wdt:P31 wd:Q5;
wdt:P569 ?born;
wdt:P570 ?died;
# died from capital punishment
wdt:P1196 wd:Q8454.
BIND(?died - ?born AS ?ageInDays).
BIND(?ageInDays/365.2425 AS ?ageInYears).
BIND(FLOOR(?ageInYears) AS ?age).
FILTER(?age > 90)
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
- One can select based on a list of items using
VALUES
SELECT ?item ?itemLabel ?mother ?motherLabel
WHERE {
# A. Einstein or J.S. Bach
VALUES ?item { wd:Q937 wd:Q1339 }
# mother of
OPTIONAL { ?item wdt:P25 ?mother. }
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
- The Label Service extension automatically generates labels as follows:
?xxxLabel
as a shortcut forrdfs:label
?xxxAltLabel
as a shortcut forskos:altLabel
?xxxDescription
as a shortcut forschema:description
Fun SPARQL queries related to football๐
Get the ๐ณ๐ฑ dutch nicknames of a team:
# get the dutch nicknames from Bayern Mรผnchen
SELECT ?item ?itemLabel ?itemDescription ?itemAltLabel
WHERE {
VALUES ?item { wd:Q15789 }
SERVICE wikibase:label { bd:serviceParam wikibase:language "nl". }
}
Get the stadium names of the teams that are part of the Big 5:
SELECT ?item ?itemLabel ?venue ?venueLabel ?venueAltLabel
WHERE
{
?item wdt:P31/wdt:P279* wd:Q847017;
wdt:P118 ?league;
wdt:P115 ?venue.
# filter for Big 5
VALUES ?league { wd:Q82595 wd:Q9448 wd:Q13394 wd:Q15804 wd:Q324867 wd:Q206813}.
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
ORDER BY ?league
Solutions to the tutorial exercices๐
Here are my solution to the exercises in that tutorial.
Chemical elements๐
Write a query that returns all chemical elements with their element symbol and atomic number, in order of their atomic number.
SELECT ?element ?elementLabel ?symbol ?atomic_number
WHERE
{
?element wdt:P31 wd:Q11344;
wdt:P246 ?symbol ;
wdt:P1086 ?atomic_number .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
ORDER BY ASC(?atomic_number)
Rivers that flow into the Mississippi๐
Write a query that returns all rivers that flow directly or indirectly into the Mississippi River.
SELECT ?river ?riverLabel
WHERE
{
?river wdt:P31 wd:Q4022;
wdt:P403/wdt:P403* wd:Q1497 .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
ORDER BY ASC(?riverLabel)
References to Le Figaro website๐
SELECT ?ref ?refURL WHERE {
?ref pr:P854 ?refURL .
FILTER (CONTAINS(str(?refURL),'lefigaro.fr')) .
} LIMIT 10
Now that you have developed a SparQL query, here is the simplest way to programatically query WikiData with python:
pandas
requests
"""SPARQL utils."""
from pathlib import Path
from typing import List
from urllib.parse import urlparse
import pandas as pd
import requests
def query_wikidata(sparql_file: str, sparql_columns: List[str]) -> pd.DataFrame:
"""Query Wikidata SPARQL API endpoint."""
wikidata_api = "https://query.wikidata.org/sparql"
query = Path(sparql_file).read_text()
r = requests.get(wikidata_api, params={"format": "json", "query": query})
data = r.json()
df = (
pd.json_normalize(data, record_path=["results", "bindings"])
.rename(columns={c + ".value": c for c in sparql_columns})[sparql_columns]
.assign(q_id=lambda d: d.item.apply(lambda u: Path(urlparse(u).path).stem))
)
return df