Learn SPARQL in 5 minutes and use it to query WikiData

Learn SPARQL in 5 minutes and use it to query WikiData

Data

I'm working on Entity Linking and Knowledge Bases. In that context, exporting a relevant part of Wikidata can be really useful to build surface form dictionaries and coocurence probabilities etc... In order to know which part of Wikidata is relevant to dump, I thought we could query Wikidata (although it seems we can only download the entire dump and filter afterwards).

SPARQL is a language to formulate questions (queries) for knowledge databases. Therefore you can query Wikidata with SPARQL. At first sight, the syntax is not particularly easy and I've gone through this tutorial.

SparQL in 5 minutes

  • #is the comment character
  • The SELECT clause lists variables that you want returned (variables start with a question mark)
  • The WHERE clause contains restrictions on them, in the form of SPO triples (subject, predicate, object), e.g. ?fruit hasColor yellow.
  • On Wikidata, items and properties are not identified by human-readable names like “hasColor” (property) or “yellow” (item).
  • Instead, Wikidata items and properties are assigned an identifier, that you need to know beforehand.
    • for items, it's a Q-number, e.g. "yellow" is Q943
    • for properties, it's a P-number, e.g. "hasColor" is P462
  • you can search for the itentifiers using search term for items and P:search term for properties
  • but you should rely on autocompletion in query.wikidata.org by pressing Ctrl + Space
  • Finally, you need to include prefix namespaces to query the WQDS (Wikidata Query Service). There are many prefixes, one for each namespace in SPARQL :
    • wd: for items
    • wdt: for properties, pointing to the object
  • to doublecheck what prefix links to what resource, use https://prefix.cc/
  • to get more than the Wikidata ID as selectable attributes, you need to include them in the WHERE clause using SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }

Putting this all together we get:

SELECT ?fruit ?fruitLabel
WHERE
{
  # fruit hasColor yellow
  ?fruit wdt:P462 wd:Q943
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it

  • You can further filter this down by adding more triple conditions using ; character e.g. you could filter for actual fruits by doing
# fruit instance of or subclass of a fruit
?fruit wdt:P31/wdt:P279* wd:Q3314483;
  • Advanced filters :
    • p: for properties, pointing to the subject
    • ps: for property statement
    • pq: for property qualifier
  • You can abbreviate a lot with the [] syntax
SELECT ?painting ?paintingLabel ?material ?materialLabel
WHERE
{
  # element is a painting
  ?painting wdt:P31/wdt:P279* wd:Q3305213;
  # extract the statement node 'material' (P186)
            p:P186 [
              # get material property statement
              ps:P186 ?material;
              # 'applies to part'(P518) 'painting surface'(Q861259)
              pq:P518 wd:Q861259
            ].
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
  • More grammar by example: ORDER BY, LIMIT
SELECT ?country ?countryLabel ?population
WHERE
{
  # instances of sovereign state
  ?country wdt:P31/wdt:P279* wd:Q3624078;
  # hasPopulation populationValue
           wdt:P1082 ?population.
  # filter for english translations
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
# ASC(?something) or DESC(?something)
ORDER BY DESC(?population)
LIMIT 10
  • If you add more variables like population above, the query will filter out countries that don't have a population value. To fix this, use an OPTIONAL clause
SELECT ?book ?title ?illustratorLabel ?publisherLabel ?published
WHERE
{
  ?book wdt:P50 wd:Q35610.
  OPTIONAL { ?book wdt:P1476 ?title. }
  OPTIONAL { ?book wdt:P110 ?illustrator. }
  OPTIONAL { ?book wdt:P123 ?publisher. }
  OPTIONAL { ?book wdt:P577 ?published. }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
SELECT ?person ?personLabel ?age
WHERE
{
  # instance of human
  ?person wdt:P31 wd:Q5;
          wdt:P569 ?born;
          wdt:P570 ?died;
  # died from capital punishment
          wdt:P1196 wd:Q8454.
  BIND(?died - ?born AS ?ageInDays).
  BIND(?ageInDays/365.2425 AS ?ageInYears).
  BIND(FLOOR(?ageInYears) AS ?age).
  FILTER(?age > 90)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it

  • One can select based on a list of items using VALUES
SELECT ?item ?itemLabel ?mother ?motherLabel
WHERE {
  # A. Einstein or J.S. Bach
  VALUES ?item { wd:Q937 wd:Q1339 }
  # mother of
  OPTIONAL { ?item wdt:P25 ?mother. }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
  • The Label Service extension automatically generates labels as follows:
    • ?xxxLabel as a shortcut for rdfs:label
    • ?xxxAltLabel as a shortcut for skos:altLabel
    • ?xxxDescription as a shortcut for schema:description

Get the 🇳🇱 dutch nicknames of a team:

# get the dutch nicknames from Bayern München
SELECT ?item ?itemLabel ?itemDescription ?itemAltLabel
WHERE {
  VALUES ?item { wd:Q15789 }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "nl". }
}

Try it

Get the stadium names of the teams that are part of the Big 5:

SELECT ?item ?itemLabel ?venue ?venueLabel ?venueAltLabel
WHERE
{
  ?item wdt:P31/wdt:P279* wd:Q847017;
        wdt:P118 ?league;
        wdt:P115 ?venue.
  # filter for Big 5
  VALUES ?league { wd:Q82595 wd:Q9448 wd:Q13394 wd:Q15804 wd:Q324867 wd:Q206813}.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
ORDER BY ?league

Try it

Solutions to the tutorial exercices

Here are my solution to the exercises in that tutorial.

Chemical elements

Write a query that returns all chemical elements with their element symbol and atomic number, in order of their atomic number.

SELECT ?element ?elementLabel ?symbol ?atomic_number
WHERE
{
  ?element wdt:P31 wd:Q11344;
           wdt:P246 ?symbol ;
           wdt:P1086 ?atomic_number .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
ORDER BY ASC(?atomic_number)

Try it

Rivers that flow into the Mississippi

Write a query that returns all rivers that flow directly or indirectly into the Mississippi River.

SELECT ?river ?riverLabel
WHERE
{
  ?river wdt:P31 wd:Q4022;
         wdt:P403/wdt:P403* wd:Q1497 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
ORDER BY ASC(?riverLabel)

Try it

References to Le Figaro website

SELECT ?ref ?refURL WHERE {
  ?ref pr:P854 ?refURL .
  FILTER (CONTAINS(str(?refURL),'lefigaro.fr')) .
} LIMIT 10

Try it

Querying WikiData with SparQL and Python

Now that you have developed a SparQL query, here is the simplest way to programatically query WikiData with python:

requirements.txt
pandas
requests
sparql.py
"""SPARQL utils."""
from pathlib import Path
from typing import List
from urllib.parse import urlparse

import pandas as pd
import requests


def query_wikidata(sparql_file: str, sparql_columns: List[str]) -> pd.DataFrame:
    """Query Wikidata SPARQL API endpoint."""
    wikidata_api = "https://query.wikidata.org/sparql"
    query = Path(sparql_file).read_text()
    r = requests.get(wikidata_api, params={"format": "json", "query": query})
    data = r.json()
    df = (
        pd.json_normalize(data, record_path=["results", "bindings"])
        .rename(columns={c + ".value": c for c in sparql_columns})[sparql_columns]
        .assign(q_id=lambda d: d.item.apply(lambda u: Path(urlparse(u).path).stem))
    )
    return df