Best practices for entity disambiguation using public records
There are many companies with the same name, and many more people. How do you know you’re looking at the right one?
Entity disambiguation and entity resolution — that is, the process of determining whether multiple occurrences of the same name refer to a single real-world entity and, if so, resolving them together — can be complex. If you are conducting research across multiple languages or in a data-poor environment, it becomes even harder.
The strongest disambiguations will use a unique identifier. Because unique identifiers are issued to only one entity at a time, you can be confident in disambiguating and/or resolving entities.
Examples of unique identifiers:
- Company names and Uniform Social Credit Numbers in China
- National ID Numbers in Iran for companies and individuals
- Mexico’s National ID Number (Clave Única de Registro de Población, CURP) for individuals
Beware of identifiers that may seem to be unique but in fact are not. These should not be used as the sole basis for disambiguation. For example:
- Registration numbers in Iran for companies
- Reason for Registration Code in Russia for companies
- Addresses and telephone numbers
- Otherwise unique identifiers that may have been recycled after a company’s closure or person’s death.
But what about when unique identifiers are scarce? Here are a few best practices for using the available data for more complex disambiguations.
1. Combinations of identifiers
Combinations of identifiers (e.g. full name + date of birth + citizenship).
For example, dates of birth are not unique – but if two individuals by the same name have different dates of birth, you can state with increased confidence that those are different individuals, particularly if they also have distinct citizenships, addresses, or other identifiers.
2. Statistical Reference
If you know exactly how many entities by that name exist in that location, you may be able to short-cut the resolution process.
For example: two Lebanese citizens by the same name own two different companies. Sayari Graph provides access to Lebanon’s complete voter roll, which lists the full name and date of birth of every voting citizen. You can reference this source to determine exactly how many individuals by that name are citizens. If there is only one, you can confidently assess that the same person owns both companies. If there are two hundred, or two thousand, then not. (For an interesting example of this technique in practice, read this.)
3. Co-occurrence in Relationships
You also can take a graph-based approach to entity disambiguation and resolution. Consider two different companies, Company A and Company B. Each company has only one shareholder; both shareholders have the exact same name but no other identifiers. From this information alone, we can’t be certain whether these are the same shareholders or not.
But what if Company A had five shareholders, and Company B also had five shareholders by the exact same names? This would make it much more likely that Company A and Company B are owned by the same five individuals. What if the two companies additionally shared an address and were incorporated on the same date? Combinations of identifiers come back into play.
If there is any doubt about a disambiguation, it is always best to consider those entities as separate but possibly linked.
