Skip to content

Panama Papers Investigation using Entity Resolution and Entity Linking

  • #NLP
Read time: 9 minutes
Panama Papers Investigation using Entity Resolution and Entity Linking
Panama Papers Investigation using Entity Resolution and Entity Linking

If you’ve worked with a corpus of text, chances are you needed to structure its information specifically for your domain. How can you link the entities mentioned in the articles to a knowledge base you control, which you can enrich and which might evolve depending on your focus?

Imagine you are an investigative journalist sifting through the Panama Papers and you are following a lead: the consortium called “Londex Resources S.A.”. You’re not sure what people, organisations, countries or other articles are connected to that lead. Perhaps one of them can be your next breakthrough?

In this article, we will demonstrate a technical approach that combines Entity Resolution performed with Senzing with Entity Linking performed in spaCy. We show how this can be used to construct a domain-specific Knowledge Graph, e.g. around a lead you’re following, to analyse your corpus with it. We will then show how to close the loop and use the analysed corpus to update the Knowledge Graph with new leads.

Along with this blog post, we have open-sourced a package to do zero-shot entity linking spacy-lancedb-linker, and released a tutorial for reference erkg-tutorials.

Articles from the ICIJ Offshore Leaks dataset🔗

For this blog post, we will be looking at a set of articles from investigative journalism like Panama Papers, Pandora Papers , and Offshore Leaks. Those are cross-border investigations that have made the headlines and were led by ICIJ (International Consortium of Investigative Journalists).

ICIJ maintains the ICIJ Offshore Leaks dataset, in the form of either a Neo4J database or a set of zipped CSV files. The dataset contains 4 main entity types.

Data schema of the ICIJ Offshore Leaks dataset
Data schema of the ICIJ Offshore Leaks dataset

Persons or “Officers” are directors, shareholders, and beneficiaries of offshore companies. For example presidents, royals, members of parliament, their family members, their closest associates. “Intermediaries” are secrecy brokers like banks or law firms that Officers turn to to optimize their finances. Organizations or “Entities” are shell companies established by secrecy brokers. “Addresses” are countries, world regions, secret jurisdictions of Officers, Entities or Intermediaries.

ICIJ Offshore Leaks node for Arzu Aliyeva
ICIJ Offshore Leaks node for Arzu Aliyeva

For example, Offshore Leaks has shown that Arzu Aliyeva, daughter of Ilham Aliyev, president of Azerbaijan, lives in Dubaï and is a shareholder and director of Arbor Investments Ltd, registered in the Virgin Islands. This creates a natural graph that connects Arzu Aliyeva to other Officers like Hassan Gozal.

This dataset is commonly used to show UBO (Ultimate Beneficial Owner) or AML (Anti Money Laundering) use cases. Prior work shows how to use this data in Neo4j, in Linkurious and shows typical investigation journalists write with that data. In this blog post, we will rather show how a Senzing-preprocessed version of this dataset can be used to power an Entity Linking use cases.

Overview of the high-level architecture🔗

Senzing provides a purpose-built AI for real-time entity resolution. Senzing Founder/CEO Jeff Jonas said: “[we want to help] developers fast-track their success with the force-multiplying effect of entity-resolved knowledge graphs (ERKG)”. They have previously shown how to extract personally identifiable information (PII) from the ICIJ graph to be used as input into Senzing. After configuring and running Senzing, an export of JSON ER results can be used to construct or update a Knowledge Graph (KG), called an entity-resolved knowledge graph (ERKG). Pre-computed ER results for ICIJ are shared as a dataset by Senzing in a GCP public bucket.

High-level architecture for this blog post
High-level architecture for this blog post

While other tutorials show the ICIJ Offshore Leaks data loading into graph databases and entity resolution with Senzing, this tutorial starts with the Senzing export. With a custom Data Engineering pipeline, we can ingest Entity Resolution results into an Approximate Nearest Neighbors (ANN) index stored in LanceDB. We can then use that index in a spaCy pipeline to run Entity Linking against a small dataset of scraped ICIJ web articles. The end-user can then use the output of the entity linking.

Dagster lineage graph of the pipeline built for this article
Dagster lineage graph of the pipeline built for this article

In practice, in louisguitton/erkg-tutorials we built this data pipeline in Python using an orchestration tool which helps visualise it. The Senzing ER results feed a Senzing pipeline that builds the EL inputs, which feeds a spacy pipeline. Next, we will see in detail how to use the ERKG to power Entity Linking.

From a Suspicion to Entity Linking🔗

The problem we have with the ICIJ Offshore Leaks dataset is that it is big data. With 1.5M entities and 4.9M aliases, processing it requires engineering optimisations that are beyond the scope of this article. This also means - as a disclaimer - that the techniques we show today are not productized for that scale. Instead, we show the concept on a smaller subset of the dataset, which we hope the reader can then adapt to their context.

Similarly, when doing Entity Linking against Wikidata or DBPedia, one usually doesn’t load the entire Knowledge Graph into the entity linking pipeline, but instead, we query for a subset of the KG using query languages like SPARQL, or by building custom KGs from smaller files (CSVs or JSONs).

Also in practice, investigative journalists work off so-called Case Management Systems. In that workflow they use software to organize and analyze information, they get assigned a "lead" (a specific person or company) and they only look at the immediate subgraph for that lead.

For those reasons, we start from a text file called data/icij-example/suspicious.txt, where the investigative journalist can seed the system. Let’s say the lead you have to explore is the consortium called “Londex Resources S.A.” which has ties with the Azerbaijani presidential family: you start by providing a few entity names from the Senzing ERKG you care about. Here, we start with Arzu Aliyeva the daughter, Ilham Aliyev the president, etc…

📁 suspicious.txt
Arzu Aliyeva
Ilham Aliyev
Mossack Fonseca
Fazil Mammadov
AtaHolding
FM Management Holding Group S.A. Stand
UF Universe Foundation
Mehriban Aliyeva
Heydar Aliyev
Leyla Aliyeva
AtaHolding Azerbaijan
Financial Management Holding Limited
Hughson Management Inc.

From that, we’re able to filter down (using a friend-of-friend logic) the ERKG to less than 100 entities of interest. That’s the immediate subgraph to our lead. Starting with this might be enough. If it turns out it isn’t, you can expand the subgraph by either adding seed entities to suspicious.txt or by adding more friends of friends.

Once we’ve filtered out the ERKG, we extract aliases into the aliases.jsonl file in the format required by the entity linking library we wrote.

📁 aliases.jsonl
{"alias":"Ilham Aliyev","entities":["1342265","1551574"],"probabilities":[0.5,0.5]}
{"alias":"Arzu Aliyeva","entities":["281073","918573","1470056","1722271","1697384","1380470"],"probabilities":[0.1666666667,0.1666666667,0.1666666667,0.1666666667,0.1666666667,0.1666666667]}
{"alias":"Arzu Ilham Qizi Aliyeva","entities":["883102"],"probabilities":[1.0]}

We also need to generate entity descriptions from the ERKG to populate the second file required by the entity linking library, entities.jsonl. We generate those descriptions by joining together the structured features available in the ERKG.

📁 entities.jsonl
{"entity_id": "1342265", "type": "PER", "name": "Ilham Aliyev", "description": "Ilham Aliyev, located at P.O. BOX 17920 JEBEL ALI FREE ZONE DUBAI UAE, in United Arab Emirates"}
{"entity_id": "1697384", "type": "PER", "name": "Arzu Aliyeva", "description": "Arzu Aliyeva, located at APARTMENT NO. 1801 DUBAI MARINA LEREV RESIDENTIAL DUBAI U.A.E., in United Arab Emirates"}
{"entity_id": "1551574", "type": "ORG", "name": "Rosamund International Ltd", "description": "Rosamund International Ltd, located at PORTCULLIS TRUSTNET CHAMBERS P.O. BOX 3444 ROAD TOWN, TORTOLA BRITISH VIRGIN ISLANDS, in British Virgin Islands"}

Introducing spacy-lancedb-linker, a new library for ANN Entity Linking with spacy🔗

With our two artifacts ready, we can start using entity linking. Entity Linking is one of the common NLP tasks.

Entity Linking and Discovery
Entity Linking and Discovery

A more formal definition of Entity Linking can be found in the Zshot paper by IBM:

Entity Linking, also known as named entity disambiguation, is the process of identifying and disambiguating mentions of entities in a text, linking them to their corresponding entries in a knowledge base or a dictionary. For example, given "Barack Obama", entity linking would determine that this refers to the specific person with that name (one of the presidents of the United States) and not any other person or concept with the same name. [...] Entity linking can be useful for a variety of natural language processing tasks, such as information extraction, question answering, and text summarization. It helps to provide context and background information about the entities mentioned in the text, which can facilitate a deeper understanding of the content.

Several techniques can be used for entity linking. From deep learning and supervised learning to unsupervised learning approaches. They usually have two stages: candidate creation and candidate ranking. In candidate creation, the approaches aim to narrow down the vast number of entities into a manageable subset (e.g., tens or hundreds), and in candidate ranking, the approaches aim to rank the candidate entities of each mention according to the probability that they match the given mention.

Example of the two steps required for entity linking: candidate creation and candidate ranking
Example of the two steps required for entity linking: candidate creation and candidate ranking

When it comes to open-source implementations at our disposal, there is of course spaCy’s Entity Linker although it uses supervised learning and thus requires labels which is not practical when quickly iterating. There is also IBM’s zshot Linker which implements 5 deep-learning linkers and is zero-shot, but still, the underlying models are using deep learnings thus might be slower, and were trained on labels. We found Microsoft’s spaCy-compatible ANN linker which uses unsupervised learning, building an Approximate Nearest Neighbors (ANN) index computed on the Character N-Gram TF-IDF representation of all aliases in your KnowledgeBase. This approach was the most fitting for our use case. Unfortunately, the project is not supported anymore, the last commit is from 2 years ago and the ANN index used (nmslib) was causing setup errors.

Inspired by microsoft/spacy-ann-linker, we therefore wrote our own ANN entity linking library louisguitton/spacy-lancedb-linker, swapping nmslib for a supported and active ANN index LanceDB. The result is a simple API that we can use to run unsupervised entity linking in spaCy:

📁 entity_linking.py
from typing import Iterator


import srsly
from spacy.language import Language
from spacy.tokens import Doc, DocBin
from spacy_lancedb_linker.kb import AnnKnowledgeBase
from spacy_lancedb_linker.linker import AnnLinker  # noqa
from spacy_lancedb_linker.types import Alias, Entity




def entity_linking(nlp: Language, spacy_dataset: DocBin) -> Iterator[Doc]:
   entities = [Entity(**entity) for entity in srsly.read_jsonl("data/icij-example/entities.jsonl")]


   aliases = [Alias(**alias) for alias in srsly.read_jsonl("data/icij-example/aliases.jsonl")]


   ann_kb = AnnKnowledgeBase(uri="data/sample-lancedb")
   ann_kb.add_entities(entities)
   ann_kb.add_aliases(aliases)


   ann_linker = nlp.add_pipe("ann_linker", last=True)
   ann_linker.set_kb(ann_kb)


   docs = spacy_dataset.get_docs(nlp.vocab)
   return nlp.pipe(docs)

Combining all the pieces🔗

To recap, we start from Senzing's ERKG for ICIJ, we filter it using the lead to follow in suspicious.txt, we generate the two artifacts that we need for spacy-lancedb-linker, and we now can put together an Entity Linking pipeline. Let’s have a look at the output of the Entity Linking on an ICIJ web article about the Azeri presidential family:

ERKG-powered Entity Linking of an ICIJ article on the Azeri presidential family
ERKG-powered Entity Linking of an ICIJ article on the Azeri presidential family

The Entity Linking here can be used for information extraction, or to provide context and background information about the entities mentioned in the text. We can also use the following simple heuristic: if an entity is not linking to anything in the KB, but is central to the article, maybe it could be worth investigating next.

To implement this, we show in the tutorial how to use DerwenAI/pytextrank to rank entities and filter for entities not linked. This can form the basis of a human-in-the-loop system where the investigative journalist updates the KB or decides what leads to follow next. In the case of this article, we see that Londex Resources S.A. seems to be mentioned 2 times and ranked in position 19 in terms of the most important entities in the article. So we can then explore the ICIJ Offshore Leaks dataset to see if that entity is known and linked to others, and if not can decide to investigate it further.

Table of entities up for review by the investigative journalist
Table of entities up for review by the investigative journalist

We hope this blog post was useful in demonstrating a technical approach that combines Entity Resolution performed with Senzing with Entity Linking performed in spaCy. We showed how this can be used to construct a domain-specific Knowledge Graph, in particular around the Azerbaijan presidential family, and we showed how to analyze a corpus of articles with this pipeline and come up with new leads.

If you’re curious about this approach, check out the reference tutorial at erkg-tutorials and the unsupervised entity linking library we’ve open-sourced spacy-lancedb-linker.

References🔗

  1. https://en.wikipedia.org/wiki/Panama_Papers
  2. https://en.wikipedia.org/wiki/Pandora_Papers
  3. https://en.wikipedia.org/wiki/Offshore_Leaks
  4. https://www.icij.org/about/
  5. https://offshoreleaks.icij.org/pages/database
  6. https://offshoreleaks.icij.org/nodes/78392
  7. https://neo4j.com/blog/analyzing-panama-papers-neo4j/
  8. https://source.opennews.org/articles/people-and-tech-behind-panama-papers/
  9. https://www.theguardian.com/news/2016/apr/03/what-you-need-to-know-about-the-panama-papers
  10. https://senzing.com/about/
  11. https://github.com/Senzing/mapper-icij
  12. https://senzing.com/entity-resolved-knowledge-graphs/
  13. https://storage.googleapis.com/erkg/icij/ICIJ-entity-report-2024-06-21_12-04-57-std.json.zip
  14. https://github.com/louisguitton/spacy-lancedb-linker
  15. https://dagster.io/
  16. https://github.com/louisguitton/erkg-tutorials
  17. https://www.kaseware.com/case-management
  18. https://github.com/louisguitton/spacy-lancedb-linker
  19. Entity Linking and Discovery via Arborescence-based Supervised Clustering https://arxiv.org/pdf/2109.01242
  20. https://arxiv.org/pdf/2307.13497
  21. Reveal the Unknown: Out-of-Knowledge-Base Mention Discovery with Entity Linking https://arxiv.org/pdf/2302.07189
  22. Low-Rank Subspaces for Unsupervised Entity Linking https://arxiv.org/pdf/2104.08737
  23. Reveal the Unknown: Out-of-Knowledge-Base Mention Discovery with Entity Linking https://arxiv.org/pdf/2302.0718
  24. https://spacy.io/usage/linguistic-features#entity-linking
  25. https://ibm.github.io/zshot/#linker
  26. https://microsoft.github.io/spacy-ann-linker/
  27. https://github.com/nmslib/nmslib
  28. https://github.com/lancedb/lancedb
  29. https://github.com/DerwenAI/pytextrank