Panama Papers Investigation using Entity Resolution and Entity Linking
- #NLP
If you’ve worked with a corpus of text, chances are you needed to structure its information specifically for your domain. How can you link the entities mentioned in the articles to a knowledge base you control, which you can enrich and which might evolve depending on your focus?
Imagine you are an investigative journalist sifting through the Panama Papers and you are following a lead: the consortium called “Londex Resources S.A.”. You’re not sure what people, organizations, countries or other articles are connected to that lead. Perhaps one of them can be your next breakthrough?
In this article, we will demonstrate a technical approach that combines Entity Resolution performed with Senzing with Entity Linking performed in spaCy. We show how this can be used to construct a domain-specific Knowledge Graph, e.g. around a lead you’re following, to analyze your corpus with it. We will then show how to close the loop and use the analyzed corpus to update the Knowledge Graph with new leads.
Along with this blog post, we have open-sourced a package to do zero-shot entity linking spacy-lancedb-linker, and released a tutorial for reference erkg-tutorials.
Articles from the ICIJ Offshore Leaks dataset🔗
For this blog post, we will be looking at a set of articles from investigative journalism like Panama Papers, Pandora Papers , and Offshore Leaks. Those are cross-border investigations that have made the headlines and were led by ICIJ (International Consortium of Investigative Journalists).
ICIJ maintains the ICIJ Offshore Leaks dataset, in the form of either a Neo4J database or a set of zipped CSV files. The dataset contains 4 main entity types.
Persons or “Officers” are directors, shareholders, and beneficiaries of offshore companies. For example presidents, royals, members of parliament, their family members, their closest associates. “Intermediaries” are secrecy brokers like banks or law firms that Officers turn to to optimize their finances. Organizations or “Entities” are shell companies established by secrecy brokers. “Addresses” are countries, world regions, secret jurisdictions of Officers, Entities or Intermediaries.
For example, Offshore Leaks has shown that Arzu Aliyeva, daughter of Ilham Aliyev, president of Azerbaijan, lives in Dubaï and is a shareholder and director of Arbor Investments Ltd, registered in the Virgin Islands. This creates a natural graph that connects Arzu Aliyeva to other Officers like Hassan Gozal.
This dataset is commonly used to show UBO (Ultimate Beneficial Owner) or reveal or investigate AML (Anti Money Laundering) scenarios. Prior work shows how to use this data in Neo4j, in Linkurious and shows typical investigations written with that data. In this blog post, we will rather show how a Senzing-preprocessed version of this dataset can be used to power an Entity Linking use case.
Overview of the high-level architecture🔗
Senzing provides a development library for Principle-Based Entity Resolution based on Entity-Centric Learning. Senzing Founder/CEO Jeff Jonas said: “[we want to help] developers fast-track their entity resolution needs – as understanding who is who and who is related to who is essential – and exceptionally essential in the creation of entity resolved knowledge graphs (ERKG)”. They have previously shown how to extract personally identifiable information (PII) from the ICIJ graph to be used as input into Senzing. After configuring and running Senzing, a JSON export of entity resolution (ER) results can be used to construct or update a Knowledge Graph (KG), called an entity-resolved knowledge graph (ERKG). Pre-computed ER results for ICIJ are shared as a dataset by Senzing in a GCP public bucket (download link).
While other tutorials show the ICIJ Offshore Leaks data loading into graph databases and entity resolution with Senzing, this tutorial starts with the Senzing export. With a custom Data Engineering pipeline, we can ingest Entity Resolution results into an Approximate Nearest Neighbors (ANN) index stored in LanceDB. We can then use that index in a spaCy pipeline to run Entity Linking against a small dataset of scraped ICIJ web articles. The end-user can then use the output of the entity linking.
In practice, in louisguitton/erkg-tutorials we built this data pipeline in Python using an orchestration tool which helps visualise it. The Senzing ER results feed a Senzing pipeline that builds the EL inputs, which feeds a spacy pipeline. Next, we will see in detail how to use the ERKG to power Entity Linking.
From a Suspicion to Entity Linking🔗
While Senzing is proven to scale into billions of records, the rest of these components don't all scale the same way without performance engineering. Given that ICIJ has 1.5M records and ~5M aliases, we draw on a subset to make this tutorial quick and easy for the reader.
When doing Entity Linking against Wikidata or DBPedia, a sub-set would be considered so as not to load the entire Knowledge Graph into the entity linking pipeline. Similarly, we query for a subset of the KG using query languages like SPARQL, or by building custom KGs from smaller files (CSVs or JSONs).
Also in practice, investigative journalists work off so-called Case Management Systems. In that workflow they use software to organize and analyze information, they get assigned a "lead" (a specific person or company) and they only look at the immediate subgraph for that lead.
For those reasons, we start from a text file called data/icij-example/suspicious.txt, where the investigative journalist can seed the system. Let’s say the lead you have to explore is the consortium called “Londex Resources S.A.” which has ties with the Azerbaijani presidential family: you start by providing a few entity names from the Senzing ERKG you care about. Here, we start with Arzu Aliyeva the daughter, Ilham Aliyev the president, etc…
Arzu Aliyeva
Ilham Aliyev
Mossack Fonseca
Fazil Mammadov
AtaHolding
FM Management Holding Group S.A. Stand
UF Universe Foundation
Mehriban Aliyeva
Heydar Aliyev
Leyla Aliyeva
AtaHolding Azerbaijan
Financial Management Holding Limited
Hughson Management Inc.
From that, we’re able to filter down (using a friend-of-friend logic) the ERKG to less than 100 entities of interest. That’s the immediate subgraph to our lead. Starting with this might be enough. If it turns out it isn’t, you can expand the subgraph by either adding seed entities to suspicious.txt or by adding more friends of friends.
Once we’ve filtered out the ERKG, we extract aliases into the aliases.jsonl file in the format required by the entity linking library we wrote.
{"alias":"Ilham Aliyev","entities":["1342265","1551574"],"probabilities":[0.5,0.5]}
{"alias":"Arzu Aliyeva","entities":["281073","918573","1470056","1722271","1697384","1380470"],"probabilities":[0.1666666667,0.1666666667,0.1666666667,0.1666666667,0.1666666667,0.1666666667]}
{"alias":"Arzu Ilham Qizi Aliyeva","entities":["883102"],"probabilities":[1.0]}
We also need to generate entity descriptions from the ERKG to populate the second file required by the entity linking library, entities.jsonl. We generate those descriptions by joining together the structured features available in the ERKG.
{"entity_id": "1342265", "type": "PER", "name": "Ilham Aliyev", "description": "Ilham Aliyev, located at P.O. BOX 17920 JEBEL ALI FREE ZONE DUBAI UAE, in United Arab Emirates"}
{"entity_id": "1697384", "type": "PER", "name": "Arzu Aliyeva", "description": "Arzu Aliyeva, located at APARTMENT NO. 1801 DUBAI MARINA LEREV RESIDENTIAL DUBAI U.A.E., in United Arab Emirates"}
{"entity_id": "1551574", "type": "ORG", "name": "Rosamund International Ltd", "description": "Rosamund International Ltd, located at PORTCULLIS TRUSTNET CHAMBERS P.O. BOX 3444 ROAD TOWN, TORTOLA BRITISH VIRGIN ISLANDS, in British Virgin Islands"}
Introducing spacy-lancedb-linker, a new library for ANN Entity Linking with spacy🔗
With our two artefacts ready, we can start using entity linking. Entity Linking is one of the common NLP tasks.
A more formal definition of Entity Linking can be found in the Zshot paper by IBM:
Entity Linking, also known as named entity disambiguation, is the process of identifying and disambiguating mentions of entities in a text, linking them to their corresponding entries in a knowledge base or a dictionary. For example, given "Barack Obama", entity linking would determine that this refers to the specific person with that name (one of the presidents of the United States) and not any other person or concept with the same name. [...] Entity linking can be useful for a variety of natural language processing tasks, such as information extraction, question answering, and text summarization. It helps to provide context and background information about the entities mentioned in the text, which can facilitate a deeper understanding of the content.
Several techniques can be used for entity linking. From deep learning and supervised learning to unsupervised learning approaches. They usually have two stages: candidate creation and candidate ranking. In candidate creation, the approaches aim to narrow down the vast number of entities into a manageable subset (e.g., tens or hundreds), and in candidate ranking, the approaches aim to rank the candidate entities of each mention according to the probability that they match the given mention.
When it comes to open-source implementations at our disposal, there is of course spaCy’s Entity Linker although it uses supervised learning and thus requires labels which is not practical when quickly iterating. There is also IBM’s zshot Linker which implements 5 deep-learning linkers and is zero-shot, but still, the underlying models are using deep learnings thus might be slower, and were trained on labels. We found Microsoft’s spaCy-compatible ANN linker which uses unsupervised learning, building an Approximate Nearest Neighbors (ANN) index computed on the Character N-Gram TF-IDF representation of all aliases in your KnowledgeBase. This approach was the most fitting for our use case. Unfortunately, the project is not supported anymore, the last commit is from 2 years ago and the ANN index used (nmslib) was causing setup errors.
Inspired by microsoft/spacy-ann-linker, we therefore wrote our own ANN entity linking library louisguitton/spacy-lancedb-linker, swapping nmslib for a supported and active ANN index LanceDB. The result is a simple API that we can use to run unsupervised entity linking in spaCy:
from typing import Iterator
import srsly
from spacy.language import Language
from spacy.tokens import Doc, DocBin
from spacy_lancedb_linker.kb import AnnKnowledgeBase
from spacy_lancedb_linker.linker import AnnLinker # noqa
from spacy_lancedb_linker.types import Alias, Entity
def entity_linking(nlp: Language, spacy_dataset: DocBin) -> Iterator[Doc]:
entities = [Entity(**entity) for entity in srsly.read_jsonl("data/icij-example/entities.jsonl")]
aliases = [Alias(**alias) for alias in srsly.read_jsonl("data/icij-example/aliases.jsonl")]
ann_kb = AnnKnowledgeBase(uri="data/sample-lancedb")
ann_kb.add_entities(entities)
ann_kb.add_aliases(aliases)
ann_linker = nlp.add_pipe("ann_linker", last=True)
ann_linker.set_kb(ann_kb)
docs = spacy_dataset.get_docs(nlp.vocab)
return nlp.pipe(docs)
Combining all the pieces🔗
To recap, we start from Senzing's ERKG for ICIJ, we filter it using the lead to follow in suspicious.txt, we generate the two artifacts that we need for spacy-lancedb-linker, and we now can put together an Entity Linking pipeline. Let’s have a look at the output of the Entity Linking on an ICIJ web article about the Azeri presidential family:
The Entity Linking here can be used for information extraction, or to provide context and background information about the entities mentioned in the text. We can also use the following simple heuristic: if an entity is not linking to anything in the KB, but is central to the article, maybe it could be worth investigating next.
To implement this, we show in the tutorial how to use DerwenAI/pytextrank to rank entities and filter for entities not linked. This can form the basis of a human-in-the-loop system where the investigative journalist updates the KB or decides what leads to follow next. In the case of this article, we see that Londex Resources S.A. seems to be mentioned 2 times and ranked in position 19 in terms of the most important entities in the article. So we can then explore the ICIJ Offshore Leaks dataset to see if that entity is known and linked to others, and if not can decide to investigate it further.
We hope this blog post was useful in demonstrating a technical approach that combines Entity Resolution performed with Senzing with Entity Linking performed in spaCy. We showed how this can be used to construct a domain-specific Knowledge Graph, in particular around the Azerbaijan presidential family, and we showed how to analyze a corpus of articles with this pipeline and come up with new leads.
If you’re curious about this approach, check out the reference tutorial at erkg-tutorials and the unsupervised entity linking library we’ve open-sourced spacy-lancedb-linker.
References🔗
- https://en.wikipedia.org/wiki/Panama_Papers
- https://en.wikipedia.org/wiki/Pandora_Papers
- https://en.wikipedia.org/wiki/Offshore_Leaks
- https://www.icij.org/about/
- https://offshoreleaks.icij.org/pages/database
- https://offshoreleaks.icij.org/nodes/78392
- https://neo4j.com/blog/analyzing-panama-papers-neo4j/
- https://source.opennews.org/articles/people-and-tech-behind-panama-papers/
- https://www.theguardian.com/news/2016/apr/03/what-you-need-to-know-about-the-panama-papers
- https://senzing.com/about/
- https://github.com/Senzing/mapper-icij
- https://senzing.com/entity-resolved-knowledge-graphs/
- https://storage.googleapis.com/erkg/icij/ICIJ-entity-report-2024-06-21_12-04-57-std.json.zip
- https://github.com/louisguitton/spacy-lancedb-linker
- https://dagster.io/
- https://github.com/louisguitton/erkg-tutorials
- https://www.kaseware.com/case-management
- https://github.com/louisguitton/spacy-lancedb-linker
- Entity Linking and Discovery via Arborescence-based Supervised Clustering https://arxiv.org/pdf/2109.01242
- https://arxiv.org/pdf/2307.13497
- Reveal the Unknown: Out-of-Knowledge-Base Mention Discovery with Entity Linking https://arxiv.org/pdf/2302.07189
- Low-Rank Subspaces for Unsupervised Entity Linking https://arxiv.org/pdf/2104.08737
- Reveal the Unknown: Out-of-Knowledge-Base Mention Discovery with Entity Linking https://arxiv.org/pdf/2302.0718
- https://spacy.io/usage/linguistic-features#entity-linking
- https://ibm.github.io/zshot/#linker
- https://microsoft.github.io/spacy-ann-linker/
- https://github.com/nmslib/nmslib
- https://github.com/lancedb/lancedb
- https://github.com/DerwenAI/pytextrank