Reflections 002: Democratizing discoverability in large Knowledge Graphs

9 min readMar 12, 2023

Note: This is my second article sharing personal reflections about things I do at work or in my free time. No idea here is new for sure, but I usually treat open writing as a way to push my thinking boundaries and discover new perspectives. this is only possible with your comments and feedback!

First, What is a knowledge graph?

Much ink has been spelled on knowledge graphs and their value. Without going into details, Knowledge Graphs are a powerful tool for organizing and connecting information. They provide a comprehensive view of data, making it easier to retrieve and analyze information. They usually take birth by connecting data from various siloed sources, making it possible to create a more complete picture of the world.

For the sake of this article, I will use the ALPHA10X knowledge graph as an example to show the value of this technology and also to illustrate the problem of discoverability.

The ALPHA10X knowledge graph

ALPHA10X is an impact startup building a large knowledge graph that connects Organizations, People, Capital, and ideas. The main goal is to use this graph as a knowledge engine to power many use cases like matching capital holders with companies that meet their investment thesis. This is possible via the 360° view that this knowledge graph gives about organizations (funding history, employees, linked academic works, and patents …)

The discoverability problem in large knowledge graphs

This rich context coming from the entities' interconnections joined with heavy processing of textual corpora linked to our main entities (Infomation extraction) makes it easy to surface relevant targets that answer the graph consumer needs. Think of it as a tagging system the more the tags are specific the more it is easy to search.

The graph consumers usually do not know people or organizations' names nor their works (patents, scientific papers) but they are familiar with concepts and technologies. And our goal as graph builders is to enrich as much as possible our core entities to ease findability.

As a curious reader, I assume that you already see the problem here. The discoverability of entities correlates with their context. The more links (more specifically links to quality entities with textual information) an entity have the more we can enrich thus the easier to find

I call this the discoverability circle where its size correlate with entities context.

Yet, not all Entities are created equal!

Not all entities have the same context in terms of volume. for example, Microsoft (a 47 years old company) has thousands of patents, a well-known portfolio of products, and many more … On the other hand, there are early-stage startups that just started and only have a landing page where they describe their product.

You can stop me here and tell me that this discoverability problem is just a consequence of an ill-defined business need. Maybe if our goal is small companies why waste time and resources on merging data sources that are biased toward known and well-established companies and give less context about small companies? Why not procure data sources that are well-suited to our needs and if they don’t exist build systems that get you the data you need (scrapping companies' websites for example)?

I'm happy you asked this question! In fact, After I joined ALPHA10X as a data scientist it took me months to find a proper answer so I can justify what are we doing.

First of all, even if we decide to focus on a specific type of organization (early-stage companies for example) we are treating our entities as independent and losing a lot of value that motivates the usage of knowledge graphs to present data. Let me give an example companies are created by people who one day had an idea about a problem they wanted to solve. these people didn't just wake up and decided to be a founder but had a history they worked in other companies in the same industry or were a researcher for a while. Thus focusing only on the present we lose a big proportion of the context.

Another point is the scale, There is no single data provider able of giving a data source that can track small companies on a large scale. Imagine the technical complexity you should be first aware of in a company creation which is usually a page on LinkedIn or a website detailing their vision and product. and then you should scrap and analyze any textual corpus linked to these companies. even if we solved the technical part imagine doing this on millions of companies and managing to keep an updated version of your data source monthly or even daily given today's dynamics!

Yeah but still I dont see the difference. why there are data sources with rich context about million of companies or people? Linkedin when it comes to organizations and people for example? Again even if we suppose first that these data sources are open and they are not! they still give a sparse context about companies. if we focus on small companies or newly created ones what you can get is a company description of what they do and some of their employee's information (experience, skills …) another point is that these data sources are crowdsourced, in other words, people create knowledge. It takes time and a good network effect to build a good knowledge base in terms of richness and context!

Solving the discoverability problem one ‘greedy’ step at a time

Okay let's summarize what we have learned till now :

Knowledge graphs are good tools to represent data and its context.
When searching for entities, not all of them are equal nor have the same level of context. Making it hard to surface small entities thus biasing the search toward well-known entities just because we have more data about them.
Assuming limited resources we can't just focus on integrating specific data sources that target a specific type of entities as the base of our knowledge graph. In the end, the value of knowledge graphs is in their connectivity and how we link the past with the present!
Solving discoverability bias is doable. Yet at scale is costly!

Now that we are on the same page, I will try to share with you some reflections on how we can solve this discoverability bias in a way that is iterative and efficient.

At a higher level, it is a simple plan. we know we can’t enrich all entities in one shot given the scale. But we can enrich a batch each time given the limits of our resources. That is why we should be smart about choosing which batch to prioritize when enriching.

To do so we need a quantity to guide us when choosing what entities to prioritize!

The discoverability score

We have talked about the discoverability correlation to a given entity context (its own textual corpora and links to other entities with a textual corpus) from which we can extract concepts that we can use to surface entities. given this correlation, we can quantify discoverability and make a general metric that helps us rank entities. moreover, measuring a given entity's discoverability score cannot be done using the entity alone but also its connection to others. Let me give an example imagine we want to measure Microsoft discoverability. if we take it as a standalone entity we only have its description and portfolio of products. But what about their patents, and their employee's scientific work … this can give an indirect context that helps find Microsoft from different perspectives.

That is why discoverability as a metric should be part of a metadata layer that mirrors the data layer. By doing so, discoverability becomes dynamic starting at an initial value that changes by exchanging with neighbor entities. In other words, whenever you add more context to your entities via integrating new data sources or targeted enrichment (by scrapping for example companies' web pages) the discoverability score is not updated only for these enriched entities but also the ones that are linked to it (at 1-hop,2-hop …)

Okay, enough abstraction. Discoverability at the end is a score/rank to guide your enrichment strategy when having limited resources.

let’s put it to work and see how it helps as a score but also its value being part of the metadata layer that mirrors our graph structure.

A greedy enrichment strategy using the discoverability score

for the sake of a real-life example, let's go back to the case of the ALPHA10X knowledge graph. We assume that we have put in place this discoverability score and that each entity has its own score.

If you recall we had this hypothesis that new early-stage organizations (young companies ) have low discoverability scores because they are new and there is a latency to track their related works. now it is easy to check. In fact, a simple statistical test comparing a sample of early-stage companies with the general population will show it.

Okay, now that we have a medium to test our hypothesis. We decided to enrich them. We did some research and understood that the best way to get more context about these low-context organizations is to use their websites as data source and integrate it into our graph.

we put in place an enrichment process that goes from an organization's website URL and output an ego graph centered around the organization (related concepts, product, and solutions offered …)

We measured the unit cost of this enrichment process and are able to estimate how many entities we can enrich each time given the allocated resources.

Now we can leverage the discoverability score as an inversed rank to prioritize entities to enrich first.

Okay but till now we didn't see why we should invest in a metadata layer that mirrors our data graph structure.

Good question!

Assuming limited resources we are talking about a quota of entities that we can enrich. For now, we used the discoverability score of our entities to know which ones to enrich first. but is it always a good strategy? Is it efficient? Is it a guarantee of uniform enrichment?

Good question”s”!

First, it is a good starting point to assume that the best way to reduce discoverability bias is by targeting entities based on their low context. But wait! We are dealing with a large knowledge graph where our core entities (organizations) could be partitioned into multiple sets for example by industry. Maybe using the score alone we can keep enriching entities within the same industry just because their number is big. maybe this industry contains rich organizations but the one with low context is still big compared to other industries.

Now you can see why the discoverability score is inefficient if used as a standalone score. In fact, having a metadata layer that mirrors our graph structure we have access to other entities' scores via the graph edges. Thus entities can exchange their scores and smooth it to become a score that dont reflect only the node's need for enrichment but also its community need thus enabling us to perform a nonbiased sampling when scheduling a batch enriching event.

Putting the discoverability score as part of the graph makes it dynamic and aware of the “glocal” context. Thus when choosing a node based on its score we are not looking at its unique need but at an aggregated vote. to simplify if we choose to automate this we ve just made a graph that is aware of its richness. a graph that knows what to prioritize when enrichment credits are offered. A graph that aims for increasing its nodes' context but at the same time optimizes for reducing bias when enriching!

THE END.