The Hackett Group Announces Strategic Acquisition of Leading Gen AI Development Firm LeewayHertz
Select Page

Understanding knowledge graphs: A key to effective data governance

Knowledge Graph in ML

Listen to the article

What is Chainlink VRF

The rapid growth of data in today’s digital world has made data governance an increasingly challenging task. According to McKinsey, even the world’s leading firms can waste between 5-10% of employee time on non-value-added tasks due to poor data governance, a number that averages around 29% across enterprises. In response to these challenges, knowledge graphs have emerged as a powerful solution. A knowledge graph serves as a robust framework for integrating data from diverse sources, creating a unified, structured, and interconnected representation of information. This comprehensive view enables organizations to effectively manage their data and derive meaningful insights.

Knowledge graphs offer a wide range of applications, extending well beyond data governance. They are instrumental in unlocking numerous use cases of knowledge graphs across various domains. For instance, in the healthcare industry, knowledge graphs can enhance clinical decision-making by connecting patient data, medical literature, and treatment guidelines. In the e-commerce sector, knowledge graphs enable personalized recommendations by understanding customer preferences and product relationships.

Moreover, applications of knowledge graphs extend to facilitating advanced search capabilities, natural language processing, and intelligent question-answering systems. They empower businesses to uncover hidden connections, perform complex analyses, and drive innovation through data-driven decision-making. As organizations strive to navigate the data deluge and extract value from their information assets, knowledge graphs present a promising approach to streamline data governance, gain deeper insights, and unlock the full potential of their data resources.

This blog will explain what a knowledge graph is and delve deeper into the components, technologies, applications, and examples of knowledge graphs. We will explore how knowledge graphs have evolved and their impact on various industries. By understanding the significance of knowledge graphs, we can unlock the full potential of data and knowledge in the digital era.

An overview of knowledge graphs

Knowledge is the foundation of human understanding and progress. As our world continues its digital transformation and becomes more interconnected, the amount and intricacy of information have experienced an exponential surge. The need to organize, connect, and make sense of data has become paramount in this information-rich landscape. This is where knowledge graphs come into play.

What is a knowledge graph?

A knowledge graph is defined as a structured representation of knowledge that captures relationships between different entities. It goes beyond traditional databases by storing data and organizing it in a semantic framework. Information is expressed as interconnected nodes and edges in a knowledge graph, forming a graph-like structure. This graph enables the representation of real-world concepts, their attributes, and the relationships between them.

Knowledge graph

Importance of Knowledge graphs

Knowledge graphs have gained immense importance in today’s information age due to their ability to provide context and meaning to data. By organizing information in a graph structure, knowledge graphs allow for a more holistic and interconnected understanding of knowledge. They enable us to uncover hidden patterns, infer new relationships, and derive insights that would otherwise remain hidden in isolated data silos. Moreover, knowledge graphs facilitate data integration from diverse sources, creating comprehensive knowledge bases. This integration enables better data discoverability, enhances data quality and consistency, and supports interoperability across various domains and applications.

Evolution and development of knowledge graphs

The concept of knowledge graphs has evolved over time, building upon earlier approaches to knowledge representation and information organization. Early roots can be traced back to symbolic AI and knowledge-based systems, where attempts were made to capture human expertise in rule-based systems. Semantic networks and frames emerged as graphical representations of knowledge, capturing relationships between concepts.

The onset of the World Wide Web led to the Semantic Web’s vision, as Tim Berners-Lee proposed. The Semantic Web aimed to enrich web content with machine-readable data, enabling intelligent information retrieval and reasoning. Key technologies such as the Resource Description Framework (RDF) and the Web Ontology Language (OWL) were developed to support the representation and interlinking of data on the web.

The advancement of graph theory and graph databases further contributed to developing knowledge graphs. Graph databases provided efficient storage and querying capabilities for graph-structured data, enabling the realization of large-scale knowledge graphs. The emergence of industry initiatives such as Google Knowledge Graph, Facebook’s Social Graph, and Microsoft’s Academic Knowledge Graph showcased the practical applications and potential impact of knowledge graphs across different domains.

As the field of machine learning progressed, knowledge graphs began to intersect with AI techniques. Embedding techniques, knowledge graph completion methods, and graph neural networks opened up new avenues for leveraging knowledge graphs in machine learning applications.

Launch your project with LeewayHertz!

With a deep understanding of and expertise in knowledge graphs, we build highly intelligent and context-aware AI systems suited to your business needs.

How do knowledge graphs work?

Knowledge graphs are a powerful tool for organizing and representing structured and connected knowledge. They are designed to capture and represent information in a machine-readable way and can be used for various applications such as search engines, question-answering systems, recommendation systems, and more. Here is an overview of how knowledge graphs work:

Data integration: Knowledge graphs start with integrating data from various sources, including structured databases, unstructured text, and even real-time feeds. The data is processed and transformed into a standardized format to ensure consistency and interoperability.

Entity extraction: The next step involves identifying entities (or objects) within the data. Entities can be people, places, organizations, events, concepts, or any other type of information. For example, in a news article, entities could be the names of people mentioned, the locations discussed, or the topics covered.

Relationship extraction: Once the entities are identified, their relationships are extracted. Relationships describe how the entities are connected or related to each other. For instance, in the case of a person, relationships can include their affiliations, employment history, or social connections.

Structuring: The extracted entities and relationships are then structured into a graph format, where entities are denoted as nodes, and relationships are represented as edges connecting the nodes. This graph structure allows for a flexible and efficient representation of complex knowledge.

Knowledge representation: Knowledge graphs use a standardized language or ontology to represent the semantics of the entities and relationships. Ontologies define the types of entities and relationships and their properties, enabling a common understanding of the knowledge across different applications.

Querying and reasoning: With the knowledge graph in place, users can query it using a query language like SPARQL or GraphQL. These queries can retrieve specific information, perform complex searches, or ask questions about the data. Knowledge graphs can also support reasoning capabilities to infer new knowledge or make deductions based on existing information.

Continuous updating: Knowledge graphs are dynamic structures that can evolve. New data can be integrated into the graph, and existing information can be updated or modified as needed. This allows knowledge graphs to stay current and reflect the latest information available.

Structure and components of a knowledge graph

Knowledge graphs are structured representations of knowledge that capture the relationships and semantics between entities. In this section, we will delve into the key components that make up the structure of a knowledge graph and how they contribute to organizing and representing knowledge.

Nodes and entities

Nodes are the fundamental building blocks of a knowledge graph. They represent entities, such as people, places, concepts, objects, or any other meaningful entity within a specific domain. Each node in the knowledge graph corresponds to a unique entity and is assigned a unique identifier.

Entities in a knowledge graph can have various types and can be organized into hierarchies or categories. For example, in a knowledge graph about movies, entities can include movies, actors, directors, genres, and production companies. Each entity type typically has its own set of properties and relationships.

Edges and relationships

Edges represent the relationships between entities in a knowledge graph. They connect nodes and define the semantic associations between them. Relationships capture the meaningful connections, dependencies, or interactions between entities.

For instance, in a movie knowledge graph, relationships can include “acted in,” “directed by,” “belongs to the genre,” or “produced by.” These relationships establish connections between movies, actors, directors, genres, and production companies, forming a web of interconnected knowledge. Edges in a knowledge graph can be undirected or directed, depending on the nature of the relationship. Directed edges indicate the direction of the relationship, while undirected edges imply a bidirectional or symmetric relationship.

Properties and attributes

Properties are used to describe and provide additional information about entities in a knowledge graph. They represent the characteristics or attributes of an entity. Properties can be simple, such as a movie’s release year or an actor’s birthdate, or they can be more complex, representing structured information like addresses or biographical details.

Attributes associated with entities are stored as key-value pairs. Each property has a name or label and a corresponding value that provides specific information about the entity. Properties enable the knowledge graph to capture rich and detailed information about entities, allowing for more comprehensive analysis and retrieval of knowledge.

Metadata and labels

Metadata and labels play a vital role in knowledge graphs by providing additional contextual information about nodes, edges, and properties. Metadata can include timestamps, data sources, confidence scores, or any other relevant information that helps qualify or annotate the knowledge.

Labels, on the other hand, provide human-readable names or descriptions for nodes, edges, and properties. They serve as meaningful identifiers that make the knowledge graph more understandable and accessible to both humans and machines. Metadata and labels enhance the usability and interpretability of the knowledge graph, facilitating efficient searching, browsing, and interpretation of the underlying knowledge.

These components work together to organize and represent knowledge in a structured and interconnected manner, enabling powerful reasoning, analysis, and retrieval of information within the knowledge graph.

The role of knowledge graphs in augmenting AI and machine learning

AI and machine learning significantly impact various industries, such as healthcare, supply chains, and financial services. Knowledge graphs play a vital role in augmenting machine learning models by providing contextual information, increasing predictive accuracy, and facilitating data lineage tracking.
Machine learning heavily relies on data, and knowledge graphs enhance the process by capturing and persisting contextual information. This contextualization improves AI systems’ reliability, robustness, explainability, and trustworthiness at every stage of the machine-learning process.

In the data sourcing phase, knowledge graphs enable data lineage tracking, ensuring the integrity and trustworthiness of the data used for machine learning. Furthermore, knowledge graphs serve as an audit trail for compliance, particularly in regulated industries, supporting master data management.
Knowledge graphs add crucial context to machine learning models during the training phase, leading to better predictions and broader applicability. By leveraging the power of relationships and connections in the graph, knowledge graphs enhance feature engineering, maximizing the predictive power of models.

Once a machine learning model is developed, knowledge graphs provide the means for graph investigations and counterfactual analysis. Domain experts can explore similar communities, analyze hierarchies, and investigate dependencies, facilitating model evaluation and debugging. These insights gained from knowledge graphs pave the way for graph-native learning, taking the integration of machine learning and knowledge graphs to the next level.

Role of knowledge graphs in augmenting AI

Organizations can learn generalized, predictive features directly from the graph by performing machine learning tasks within the graph. This approach eliminates the need for prior knowledge of the most important features and optimizes the representation of connected data for machine learning models.
Furthermore, knowledge graphs have emerged as a valuable tool in machine learning and artificial intelligence applications. By combining the expressive power of graphs with machine learning algorithms, knowledge graphs enable machines to reason, infer, and learn from structured data.
Graph embedding techniques capture the structural and semantic information present in the knowledge graph and transform it into low-dimensional vector representations. These embeddings preserve the relationships and proximity between entities in the graph, enabling the application of machine learning algorithms.

Graph-based machine learning models leverage graph embeddings to perform various tasks such as link prediction, entity classification, and recommendation. Knowledge graphs augment these models by incorporating additional features or enriching feature representations, improving interpretability and explainability. Technologies commonly used in machine learning with knowledge graphs include graph neural networks (GNNs), which specialize in learning from graph-structured data and incorporating the relational information present in knowledge graphs. ML frameworks like TensorFlow and PyTorch support the integration of knowledge graphs into pipelines, enabling efficient data manipulation, preprocessing, and training.

Graph database systems equipped with machine learning capabilities, such as Neo4j’s Graph Data Science library, play a crucial role in knowledge graph-based applications. These systems enable efficient graph processing, meaningful insights extraction, and direct training of machine learning models on knowledge graphs. The integration of graph database systems and machine learning empowers scalable applications, from recommendation systems to fraud detection and drug discovery. The convergence of machine learning, knowledge graphs, and graph database systems presents immense potential. Industries are recognizing the power of these integrated technologies, driving demand for solutions that combine machine learning capabilities with the rich information encapsulated in knowledge graphs. Organizations embracing this convergence will unleash unprecedented insights, optimization, and innovation across various sectors.

How to build a knowledge graph from text?

When building a knowledge graph from text, the process typically involves two main steps:

  1. Identifying entities (Named Entity Recognition or NER): This step involves recognizing specific entities or important pieces of information in the text, such as names of people, organizations, locations, etc. These identified entities will serve as the nodes or key elements in the knowledge graph.
  2. Determining relationships between the entities (Relation Classification or RC): After identifying the entities, the next step is to analyze the text and determine the connections or relationships between these entities. These relationships will form the edges or links in the knowledge graph, depicting how different entities are related to each other in the given text.

Traditional approaches using multi-step pipelines often propagate errors or are limited to a small set of relation types. Recently, there have been proposals for end-to-end methods that address both tasks simultaneously. This combined task is commonly referred to as Relation Extraction (RE). In this context, a specific end-to-end model called REBEL is employed.

Launch your project with LeewayHertz!

With a deep understanding of and expertise in knowledge graphs, we build highly intelligent and context-aware AI systems suited to your business needs.

What is REBEL?

REBEL(Relation Extraction By End-to-end Language generation) is an autoregressive seq2seq model based on BART (Bidirectional and AutoRegressive Transformers) designed for end-to-end relation extraction. Relation extraction involves extracting relation triplets from raw text, an important task in Information extraction for various applications such as knowledge base population, fact-checking, and other downstream tasks.

The authors constructed a custom dataset for pre-training REBEL using entities and relations extracted from Wikipedia abstracts and Wikidata. To ensure the dataset’s quality, they employed a RoBERTa Natural Language Inference model for filtering. Notably, the model has demonstrated strong performance on various benchmarks for Relation Extraction and Relation Classification tasks.

Implementing the knowledge graph extraction pipeline

Let’s outline the step-by-step approach we will follow, gradually addressing more intricate scenarios:

  • Load the Relation Extraction REBEL model
  • Extract a knowledge base from a short text
  • Extract a knowledge base from a longer document
  • Filter and normalize entities
  • Extract a knowledge base from an article URL
  • Extract a knowledge base from multiple article URLs
  • Visualize the generated knowledge bases

Install and import libraries

First, install the required libraries.

pip install transformers wikipedia newspaper3k GoogleNews pyvis

We need each library for the following:

  • Transformers: Load the REBEL mode.
  • Wikipedia: Validate extracted entities by checking if they have a corresponding Wikipedia page.
  • Newspaper: Parse articles from URLs.
  • GoogleNews: Read Google News latest articles about a topic.
  • Pyvis: Graphs visualizations.

Let’s import all the necessary libraries and classes.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import math
import torch
import wikipedia
from newspaper import Article, ArticleException
from GoogleNews import GoogleNews
import IPython
from pyvis.network import Network

1. Load the Relation Extraction REBEL model

With the help of the transformers library, we can effortlessly load the pre-trained REBEL model and tokenizer using just a few lines of code.

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Babelscape/rebel-large")
model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/rebel-large")

2. Extract a knowledge base from a short text

The next task involves writing a function capable of parsing the strings produced by REBEL and converting them into relation triplets, such as the example triplet “<Fabio, lives in, Italy>”. This function must consider the inclusion of new tokens introduced during the model training process, namely the “<triplet>”, “<subj>”, and “<obj>” tokens.

def extract_relations_from_model_output(text):
    relations = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    text_replaced = text.replace("", "").replace("", "").replace("", "")
    for token in text_replaced.split():
        if token == "":
            current = 't'
            if relation != '':
                relations.append({
                    'head': subject.strip(),
                    'type': relation.strip(),
                    'tail': object_.strip()
                })
                relation = ''
            subject = ''
        elif token == "":
            current = 's'
            if relation != '':
                relations.append({
                    'head': subject.strip(),
                    'type': relation.strip(),
                    'tail': object_.strip()
                })
            object_ = ''
        elif token == "":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        relations.append({
            'head': subject.strip(),
            'type': relation.strip(),
            'tail': object_.strip()
        })
    return relations

The function outputs a list of relations, where each relation is represented as a dictionary with the following keys:

  • head: The subject of the relation (e.g. “Fabio”).
  • type: The relation type (e.g. “lives in”).
  • tail: The object of the relation (e.g. “Italy”).

Next, we will proceed to code implementing a knowledge base class. This KB class consists of a collection of relations and incorporates various methods for handling the addition of new relations to the knowledge base and printing them.

class KB():
    def __init__(self):
        self.relations = []

    def are_relations_equal(self, r1, r2):
        return all(r1[attr] == r2[attr] for attr in ["head", "type", "tail"])

    def exists_relation(self, r1):
        return any(self.are_relations_equal(r1, r2) for r2 in self.relations)

    def add_relation(self, r):
        if not self.exists_relation(r):
            self.relations.append(r)

    def print(self):
        print("Relations:")
        for r in self.relations:
            print(f"  {r}")

Lastly, we define a function called from_small_text_to_kb that returns a KB object with relations extracted from a short text. It does the following:

  1. Initialize an empty knowledge base KB object.
  2. Tokenize the input text.
  3. Utilize REBEL to generate relations from the text.
  4. Parse REBEL output and store relation triplets into the knowledge base object.
  5. Return the knowledge base object.
def from_small_text_to_kb(text, verbose=False):
    kb = KB()

    # Tokenizer text
    model_inputs = tokenizer(text, max_length=512, padding=True, truncation=True,
                            return_tensors='pt')
    if verbose:
        print(f"Num tokens: {len(model_inputs['input_ids'][0])}")

    # Generate
    gen_kwargs = {
        "max_length": 216,
        "length_penalty": 0,
        "num_beams": 3,
        "num_return_sequences": 3
    }
    generated_tokens = model.generate(
        **model_inputs,
        **gen_kwargs,
    )
    decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=False)

    # create kb
    for sentence_pred in decoded_preds:
        relations = extract_relations_from_model_output(sentence_pred)
        for r in relations:
            kb.add_relation(r)

    return kb

Let’s try the function with some text about Napoleon Bonaparte from Wikipedia.

text = "Napoleon Bonaparte (born Napoleone di Buonaparte; 15 August 1769 – 5 " \
"May 1821), and later known by his regnal name Napoleon I, was a French military " \
"and political leader who rose to prominence during the French Revolution and led " \
"several successful campaigns during the Revolutionary Wars. He was the de facto " \
"leader of the French Republic as First Consul from 1799 to 1804. As Napoleon I, " \
"he was Emperor of the French from 1804 until 1814 and again in 1815. Napoleon's " \
"political and cultural legacy has endured, and he has been one of the most " \
"celebrated and controversial leaders in world history."

kb = from_small_text_to_kb(text, verbose=True)
kb.print()
# Num tokens: 133
# Relations:
#   {'head': 'Napoleon Bonaparte', 'type': 'date of birth', 'tail': '15 August 1769'}
#   {'head': 'Napoleon Bonaparte', 'type': 'date of death', 'tail': '5 May 1821'}
#   {'head': 'Napoleon Bonaparte', 'type': 'participant in', 'tail': 'French Revolution'}
#   {'head': 'Napoleon Bonaparte', 'type': 'conflict', 'tail': 'Revolutionary Wars'}
#   {'head': 'Revolutionary Wars', 'type': 'part of', 'tail': 'French Revolution'}
#   {'head': 'French Revolution', 'type': 'participant', 'tail': 'Napoleon Bonaparte'}
#   {'head': 'Revolutionary Wars', 'type': 'participant', 'tail': 'Napoleon Bonaparte'}

The model is able to extract several relations, such as Napoleon’s date of birth and date of death and his participation in the French Revolution.

Launch your project with LeewayHertz!

With a deep understanding of and expertise in knowledge graphs, we build highly intelligent and context-aware AI systems suited to your business needs.

3. Extract a knowledge base from a longer document

Transformer models like REBEL require increasing memory as the input size grows. While REBEL can handle inputs of about 512 tokens (around 380 English words), we often need to extract relations from longer documents. To address this, we can divide a 1000-token input into overlapping spans of 128 tokens and extract relations from each span. Metadata is added to the extracted relations, indicating the span boundaries in the knowledge base. This approach enables efficient extraction and retains contextual information from the input.

Let’s modify the KB methods so that span boundaries are saved as well. The relation dictionary now has the keys:

  • head : The subject of the relation (e.g. “Fabio”).
  • type : The relation type (e.g. “lives in”).
  • tail : The object of the relation (e.g. “Italy”).
  • meta : A dictionary containing meta-information about the relation. This dictionary has a spans key, whose value is the list of span boundaries (e.g. [[0, 128], [119, 247]]) where the relation has been found.
class KB():
    def __init__(self):
        self.relations = []

    def are_relations_equal(self, r1, r2):
        return all(r1[attr] == r2[attr] for attr in ["head", "type", "tail"])

    def exists_relation(self, r1):
        return any(self.are_relations_equal(r1, r2) for r2 in self.relations)

    def merge_relations(self, r1):
        r2 = [r for r in self.relations
              if self.are_relations_equal(r1, r)][0]
        spans_to_add = [span for span in r1["meta"]["spans"]
                        if span not in r2["meta"]["spans"]]
        r2["meta"]["spans"] += spans_to_add

    def add_relation(self, r):
        if not self.exists_relation(r):
            self.relations.append(r)
        else:
            self.merge_relations(r)
    def print(self):
        print("Relations:")
        for r in self.relations:
            print(f"  {r}")

Next, we write the from_text_to_kb function, which is similar to the from_small_text_to_kb function but is able to manage longer texts by splitting them into spans. All the new code is about the spanning logic and the management of the spans into the relations.

def from_text_to_kb(text, span_length=128, verbose=False):
    # tokenize whole text
    inputs = tokenizer([text], return_tensors="pt")

    # compute span boundaries
    num_tokens = len(inputs["input_ids"][0])
    if verbose:
        print(f"Input has {num_tokens} tokens")
    num_spans = math.ceil(num_tokens / span_length)
    if verbose:
        print(f"Input has {num_spans} spans")
    overlap = math.ceil((num_spans * span_length - num_tokens) / 
                        max(num_spans - 1, 1))
    spans_boundaries = []
    start = 0
    for i in range(num_spans):
        spans_boundaries.append([start + span_length * i,
                                 start + span_length * (i + 1)])
        start -= overlap
    if verbose:
        print(f"Span boundaries are {spans_boundaries}")

    # transform input with spans
    tensor_ids = [inputs["input_ids"][0][boundary[0]:boundary[1]]
                  for boundary in spans_boundaries]
    tensor_masks = [inputs["attention_mask"][0][boundary[0]:boundary[1]]
                    for boundary in spans_boundaries]
    inputs = {
        "input_ids": torch.stack(tensor_ids),
        "attention_mask": torch.stack(tensor_masks)
    }

    # generate relations
    num_return_sequences = 3
    gen_kwargs = {
        "max_length": 256,
        "length_penalty": 0,
        "num_beams": 3,
        "num_return_sequences": num_return_sequences
    }
    generated_tokens = model.generate(
        **inputs,
        **gen_kwargs,
    )

    # decode relations
    decoded_preds = tokenizer.batch_decode(generated_tokens,
                                           skip_special_tokens=False)

    # create kb
    kb = KB()
    i = 0
    for sentence_pred in decoded_preds:
        current_span_index = i // num_return_sequences
        relations = extract_relations_from_model_output(sentence_pred)
        for relation in relations:
            relation["meta"] = {
                "spans": [spans_boundaries[current_span_index]]
            }
            kb.add_relation(relation)
        i += 1

    return kb

Let’s try it with a longer text of 726 tokens about Napoleon. We are currently splitting the text into spans that are 128 tokens long.

text = """
Napoleon Bonaparte (born Napoleone di Buonaparte; 15 August 1769 – 5 May 1821), and later known by his regnal name Napoleon I, was a French military and political leader who rose to prominence during the French Revolution and led several successful campaigns during the Revolutionary Wars. He was the de facto leader of the French Republic as First Consul from 1799 to 1804. As Napoleon I, he was Emperor of the French from 1804 until 1814 and again in 1815. Napoleon's political and cultural legacy has endured, and he has been one of the most celebrated and controversial leaders in world history. Napoleon was born on the island of Corsica not long after its annexation by the Kingdom of France.[5] He supported the French Revolution in 1789 while serving in the French army, and tried to spread its ideals to his native Corsica. He rose rapidly in the Army after he saved the governing French Directory by firing on royalist insurgents. In 1796, he began a military campaign against the Austrians and their Italian allies, scoring decisive victories and becoming a national hero. Two years later, he led a military expedition to Egypt that served as a springboard to political power. He engineered a coup in November 1799 and became First Consul of the Republic. Differences with the British meant that the French faced the War of the Third Coalition by 1805. Napoleon shattered this coalition with victories in the Ulm Campaign, and at the Battle of Austerlitz, which led to the dissolving of the Holy Roman Empire. In 1806, the Fourth Coalition took up arms against him because Prussia became worried about growing French influence on the continent. Napoleon knocked out Prussia at the battles of Jena and Auerstedt, marched the Grande Armée into Eastern Europe, annihilating the Russians in June 1807 at Friedland, and forcing the defeated nations of the Fourth Coalition to accept the Treaties of Tilsit. Two years later, the Austrians challenged the French again during the War of the Fifth Coalition, but Napoleon solidified his grip over Europe after triumphing at the Battle of Wagram. Hoping to extend the Continental System, his embargo against Britain, Napoleon invaded the Iberian Peninsula and declared his brother Joseph King of Spain in 1808. The Spanish and the Portuguese revolted in the Peninsular War, culminating in defeat for Napoleon's marshals. Napoleon launched an invasion of Russia in the summer of 1812. The resulting campaign witnessed the catastrophic retreat of Napoleon's Grande Armée. In 1813, Prussia and Austria joined Russian forces in a Sixth Coalition against France. A chaotic military campaign resulted in a large coalition army defeating Napoleon at the Battle of Leipzig in October 1813. The coalition invaded France and captured Paris, forcing Napoleon to abdicate in April 1814. He was exiled to the island of Elba, between Corsica and Italy. In France, the Bourbons were restored to power. However, Napoleon escaped Elba in February 1815 and took control of France.[6][7] The Allies responded by forming a Seventh Coalition, which defeated Napoleon at the Battle of Waterloo in June 1815. The British exiled him to the remote island of Saint Helena in the Atlantic, where he died in 1821 at the age of 51. Napoleon had an extensive impact on the modern world, bringing liberal reforms to the many countries he conquered, especially the Low Countries, Switzerland, and parts of modern Italy and Germany. He implemented liberal policies in France and Western Europe.
"""

kb = from_text_to_kb(text, verbose=True)
kb.print()
# Input has 726 tokens
# Input has 6 spans
# Span boundaries are [[0, 128], [119, 247], [238, 366], [357, 485], [476, 604], [595, 723]]
# Relations:
#   {'head': 'Napoleon Bonaparte', 'type': 'date of birth',
#    'tail': '15 August 1769', 'meta': {'spans': [[0, 128]]}}
#   ...
#   {'head': 'Napoleon', 'type': 'place of birth',
#    'tail': 'Corsica', 'meta': {'spans': [[119, 247]]}}
#   ...
#   {'head': 'Fourth Coalition', 'type': 'start time',
#    'tail': '1806', 'meta': {'spans': [[238, 366]]}}
#   ...

The text has been split into six spans, from which 23 relations have been extracted! Note that we also know from which text span each relation comes.

4. Filter and normalize entities

If you look closely at the extracted relations, you can see a relation with the entity “Napoleon Bonaparte” and the entity “Napoleon.” How can we tell our knowledge base that the two entities should be treated as the same?

One way to do this is to use the wikipedia library to check if “Napoleon Bonaparte” and “Napoleon” have the same Wikipedia page. If so, they are normalized to the title of the Wikipedia page. This step is commonly called Entity Linking.

Note that this approach relies on Wikipedia to be constantly updated by people with relevant entities. Therefore, it won’t work if you want to extract entities different from those already in Wikipedia. Moreover, note that we are ignoring “date” (e.g., the 15 August 1769 in <Napoleon, date of birth, 15 August 1769>) entities for simplicity.

Let’s modify our KB code:

  • The KB now stores an entities dictionary with the entities of the stored relations. The keys are the entity identifiers (i.e. if you look closely at the extracted relations, you can see a relation with the entity “Napoleon Bonaparte” and the entity “Napoleon.” The title of the corresponding Wikipedia page and the value is a dictionary containing the Wikipedia page url and its summary.
  • When adding a new relation, we now check its entities with the wikipedia library.
class KB():
    def __init__(self):
        self.entities = {}
        self.relations = []

    def are_relations_equal(self, r1, r2):
        return all(r1[attr] == r2[attr] for attr in ["head", "type", "tail"])

    def exists_relation(self, r1):
        return any(self.are_relations_equal(r1, r2) for r2 in self.relations)

    def merge_relations(self, r1):
        r2 = [r for r in self.relations
              if self.are_relations_equal(r1, r)][0]
        spans_to_add = [span for span in r1["meta"]["spans"]
                        if span not in r2["meta"]["spans"]]
        r2["meta"]["spans"] += spans_to_add

    def get_wikipedia_data(self, candidate_entity):
        try:
            page = wikipedia.page(candidate_entity, auto_suggest=False)
            entity_data = {
                "title": page.title,
                "url": page.url,
                "summary": page.summary
            }
            return entity_data
        except:
            return None

    def add_entity(self, e):
        self.entities[e["title"]] = {k:v for k,v in e.items() if k != "title"}

    def add_relation(self, r):
        # check on wikipedia
        candidate_entities = [r["head"], r["tail"]]
        entities = [self.get_wikipedia_data(ent) for ent in candidate_entities]

        # if one entity does not exist, stop
        if any(ent is None for ent in entities):
            return

        # manage new entities
        for e in entities:
            self.add_entity(e)

        # rename relation entities with their wikipedia titles
        r["head"] = entities[0]["title"]
        r["tail"] = entities[1]["title"]

        # manage new relation
        if not self.exists_relation(r):
            self.relations.append(r)
        else:
            self.merge_relations(r)

    def print(self):
        print("Entities:")
        for e in self.entities.items():
            print(f"  {e}")
        print("Relations:")
        for r in self.relations:
            print(f"  {r}")

Let’s extract relations and entities from the same text about Napoleon:

text = """
Napoleon Bonaparte (born Napoleone di Buonaparte; 15 August 1769 – 5 May 1821), and later known by his regnal name Napoleon I, was a French military and political leader who rose to prominence during the French Revolution and led several successful campaigns during the Revolutionary Wars. He was the de facto leader of the French Republic as First Consul from 1799 to 1804. As Napoleon I, he was Emperor of the French from 1804 until 1814 and again in 1815. Napoleon's political and cultural legacy has endured, and he has been one of the most celebrated and controversial leaders in world history. Napoleon was born on the island of Corsica not long after its annexation by the Kingdom of France.[5] He supported the French Revolution in 1789 while serving in the French army, and tried to spread its ideals to his native Corsica. He rose rapidly in the Army after he saved the governing French Directory by firing on royalist insurgents. In 1796, he began a military campaign against the Austrians and their Italian allies, scoring decisive victories and becoming a national hero. Two years later, he led a military expedition to Egypt that served as a springboard to political power. He engineered a coup in November 1799 and became First Consul of the Republic. Differences with the British meant that the French faced the War of the Third Coalition by 1805. Napoleon shattered this coalition with victories in the Ulm Campaign, and at the Battle of Austerlitz, which led to the dissolving of the Holy Roman Empire. In 1806, the Fourth Coalition took up arms against him because Prussia became worried about growing French influence on the continent. Napoleon knocked out Prussia at the battles of Jena and Auerstedt, marched the Grande Armée into Eastern Europe, annihilating the Russians in June 1807 at Friedland, and forcing the defeated nations of the Fourth Coalition to accept the Treaties of Tilsit. Two years later, the Austrians challenged the French again during the War of the Fifth Coalition, but Napoleon solidified his grip over Europe after triumphing at the Battle of Wagram. Hoping to extend the Continental System, his embargo against Britain, Napoleon invaded the Iberian Peninsula and declared his brother Joseph King of Spain in 1808. The Spanish and the Portuguese revolted in the Peninsular War, culminating in defeat for Napoleon's marshals. Napoleon launched an invasion of Russia in the summer of 1812. The resulting campaign witnessed the catastrophic retreat of Napoleon's Grande Armée. In 1813, Prussia and Austria joined Russian forces in a Sixth Coalition against France. A chaotic military campaign resulted in a large coalition army defeating Napoleon at the Battle of Leipzig in October 1813. The coalition invaded France and captured Paris, forcing Napoleon to abdicate in April 1814. He was exiled to the island of Elba, between Corsica and Italy. In France, the Bourbons were restored to power. However, Napoleon escaped Elba in February 1815 and took control of France.[6][7] The Allies responded by forming a Seventh Coalition, which defeated Napoleon at the Battle of Waterloo in June 1815. The British exiled him to the remote island of Saint Helena in the Atlantic, where he died in 1821 at the age of 51. Napoleon had an extensive impact on the modern world, bringing liberal reforms to the many countries he conquered, especially the Low Countries, Switzerland, and parts of modern Italy and Germany. He implemented liberal policies in France and Western Europe.
"""

kb = from_text_to_kb(text)
kb.print()
# Entities:
#  ('Napoleon', {'url': 'https://en.wikipedia.org/wiki/Napoleon',
#   'summary': "Napoleon Bonaparte (born Napoleone di Buonaparte; 15 August ..."})
#  ('French Revolution', {'url': 'https://en.wikipedia.org/wiki/French_Revolution',
#   'summary': 'The French Revolution (French: Révolution française..."})
#  ...
# Relations:
#  {'head': 'Napoleon', 'type': 'participant in', 'tail': 'French Revolution',
#   'meta': {'spans': [[0, 128], [119, 247]]}}
#  {'head': 'French Revolution', 'type': 'participant', 'tail': 'Napoleon',
#   'meta': {'spans': [[0, 128]]}}
#  ...

All the extracted entities are linked to Wikipedia pages and normalized with their titles. “Napoleon Bonaparte” and “Napoleon” are now both referred to as “Napoleon”!

5. Extract a knowledge base from an article URL

Taking it a step further, our knowledge base should have the capability to incorporate relations and entities extracted from web articles while also preserving the information about the specific source of each relation.

To do this, we need to modify our KB class so that:

  • Along with relations and entities, sources (i.e., articles from around the web) are stored as well. Each article has its URL as a key and a dictionary with keys article_title and article_publish_date as values.
  • When we add a new relation to our knowledge base, the relation meta field is now a dictionary with article URLs as keys and another dictionary containing the spans as values. In this way, the knowledge base keeps track of all the articles from which a specific relation has been extracted.
class KB():
    def __init__(self):
        self.entities = {} # { entity_title: {...} }
        self.relations = [] # [ head: entity_title, type: ..., tail: entity_title,
          # meta: { article_url: { spans: [...] } } ]
        self.sources = {} # { article_url: {...} }

    ...

    def merge_relations(self, r2):
        r1 = [r for r in self.relations
              if self.are_relations_equal(r2, r)][0]

        # if different article
        article_url = list(r2["meta"].keys())[0]
        if article_url not in r1["meta"]:
            r1["meta"][article_url] = r2["meta"][article_url]

        # if existing article
        else:
            spans_to_add = [span for span in r2["meta"][article_url]["spans"]
                            if span not in r1["meta"][article_url]["spans"]]
            r1["meta"][article_url]["spans"] += spans_to_add

    ...

    def add_relation(self, r, article_title, article_publish_date):
        # check on wikipedia
        candidate_entities = [r["head"], r["tail"]]
        entities = [self.get_wikipedia_data(ent) for ent in candidate_entities]

        # if one entity does not exist, stop
        if any(ent is None for ent in entities):
            return

        # manage new entities
        for e in entities:
            self.add_entity(e)

        # rename relation entities with their wikipedia titles
        r["head"] = entities[0]["title"]
        r["tail"] = entities[1]["title"]

        # add source if not in kb
        article_url = list(r["meta"].keys())[0]
        if article_url not in self.sources:
            self.sources[article_url] = {
                "article_title": article_title,
                "article_publish_date": article_publish_date
            }

        # manage new relation
        if not self.exists_relation(r):
            self.relations.append(r)
        else:
            self.merge_relations(r)

    def print(self):
        print("Entities:")
        for e in self.entities.items():
            print(f"  {e}")
        print("Relations:")
        for r in self.relations:
            print(f"  {r}")
        print("Sources:")
        for s in self.sources.items():
            print(f"  {s}")

Next, we modify the from_text_to_kb function so that it prepares the relation meta field taking into account article URLs as well.

def from_text_to_kb(text, article_url, span_length=128, article_title=None,
                    article_publish_date=None, verbose=False):
    # tokenize whole text
    inputs = tokenizer([text], return_tensors="pt")

    # compute span boundaries
    num_tokens = len(inputs["input_ids"][0])
    if verbose:
        print(f"Input has {num_tokens} tokens")
    num_spans = math.ceil(num_tokens / span_length)
    if verbose:
        print(f"Input has {num_spans} spans")
    overlap = math.ceil((num_spans * span_length - num_tokens) /
                        max(num_spans - 1, 1))
    spans_boundaries = []
    start = 0
    for i in range(num_spans):
        spans_boundaries.append([start + span_length * i,
                                 start + span_length * (i + 1)])
        start -= overlap
    if verbose:
        print(f"Span boundaries are {spans_boundaries}")

    # transform input with spans
    tensor_ids = [inputs["input_ids"][0][boundary[0]:boundary[1]]
                  for boundary in spans_boundaries]
    tensor_masks = [inputs["attention_mask"][0][boundary[0]:boundary[1]]
                    for boundary in spans_boundaries]
    inputs = {
        "input_ids": torch.stack(tensor_ids),
        "attention_mask": torch.stack(tensor_masks)
    }

    # generate relations
    num_return_sequences = 3
    gen_kwargs = {
        "max_length": 256,
        "length_penalty": 0,
        "num_beams": 3,
        "num_return_sequences": num_return_sequences
    }
    generated_tokens = model.generate(
        **inputs,
        **gen_kwargs,
    )

    # decode relations
    decoded_preds = tokenizer.batch_decode(generated_tokens,
                                           skip_special_tokens=False)
    # create kb
    kb = KB()
    i = 0
    for sentence_pred in decoded_preds:
        current_span_index = i // num_return_sequences
        relations = extract_relations_from_model_output(sentence_pred)
        for relation in relations:
            relation["meta"] = {
                article_url: {
                    "spans": [spans_boundaries[current_span_index]]
                }
            }
            kb.add_relation(relation, article_title, article_publish_date)
        i += 1

    return kb

Last, we use the newspaper library to download and parse articles from URLs and define a from_url_to_kb function. The library automatically extracts the article text, title, and publish date (if present).

def get_article(url):
    article = Article(url)
    article.download()
    article.parse()
    return article

def from_url_to_kb(url):
    article = get_article(url)
    config = {
        "article_title": article.title,
        "article_publish_date": article.publish_date
    }
    kb = from_text_to_kb(article.text, article.url, **config)
    return kb

Let’s try to extract a knowledge base from the article Microstrategy chief: ‘Bitcoin is going to go into the millions.’

url = "https://finance.yahoo.com/news/microstrategy-bitcoin-millions-142143795.html"
kb = from_url_to_kb(url)
kb.print()
# Entities:
#   ('MicroStrategy', {'url': 'https://en.wikipedia.org/wiki/MicroStrategy',
#     'summary': "MicroStrategy Incorporated is an American company that ..."})
#   ('Michael J. Saylor', {'url': 'https://en.wikipedia.org/wiki/Michael_J._Saylor',
#     'summary': 'Michael J. Saylor (born February 4, 1965) is an American ..."})
#   ...
# Relations:
#   {'head': 'MicroStrategy', 'type': 'founded by', 'tail': 'Michael J. Saylor',
#    'meta': {'https://finance.yahoo.com/news/microstrategy-bitcoin-millions-142143795.html': 
#      {'spans': [[0, 128]]}}}
#   {'head': 'Michael J. Saylor', 'type': 'employer', 'tail': 'MicroStrategy',
#    'meta': {'https://finance.yahoo.com/news/microstrategy-bitcoin-millions-142143795.html':
#      {'spans': [[0, 128]]}}}
#   ...
# Sources:
#   ('https://finance.yahoo.com/news/microstrategy-bitcoin-millions-142143795.html',
#     {'article_title': "Microstrategy chief: 'Bitcoin is going to go into the millions'",
#      'article_publish_date': None})

The KB is showing a lot of information!

  • From the entities list, we see that Microstrategy is an American company.
  • From the relations list, we see that Michael J. Saylor is a founder of the Microstrategy company, and where we extracted such relation (i.e., the article URL and the text span).
  • From the sources list, we see the title and publish date of the aforementioned article.

6. Extract a knowledge base from multiple article URLs

In the case of creating a knowledge base from multiple articles, we can manage it by extracting individual knowledge bases from each article and subsequently merging them together. Let’s add a merge_with_kb method to our KB class.

class KB():
    def __init__(self):
        self.entities = {} # { entity_title: {...} }
        self.relations = [] # [ head: entity_title, type: ..., tail: entity_title,
          # meta: { article_url: { spans: [...] } } ]
        self.sources = {} # { article_url: {...} }

    def merge_with_kb(self, kb2):
        for r in kb2.relations:
            article_url = list(r["meta"].keys())[0]
            source_data = kb2.sources[article_url]
            self.add_relation(r, source_data["article_title"],
                              source_data["article_publish_date"])
            
    def are_relations_equal(self, r1, r2):
        return all(r1[attr] == r2[attr] for attr in ["head", "type", "tail"])

    def exists_relation(self, r1):
        return any(self.are_relations_equal(r1, r2) for r2 in self.relations)

    def merge_relations(self, r2):
        r1 = [r for r in self.relations
              if self.are_relations_equal(r2, r)][0]

        # if different article
        article_url = list(r2["meta"].keys())[0]
        if article_url not in r1["meta"]:
            r1["meta"][article_url] = r2["meta"][article_url]

        # if existing article
        else:
            spans_to_add = [span for span in r2["meta"][article_url]["spans"]
                            if span not in r1["meta"][article_url]["spans"]]
            r1["meta"][article_url]["spans"] += spans_to_add

    def get_wikipedia_data(self, candidate_entity):
        try:
            page = wikipedia.page(candidate_entity, auto_suggest=False)
            entity_data = {
                "title": page.title,
                "url": page.url,
                "summary": page.summary
            }
            return entity_data
        except:
            return None

    def add_entity(self, e):
        self.entities[e["title"]] = {k:v for k,v in e.items() if k != "title"}

    def add_relation(self, r, article_title, article_publish_date):
        # check on wikipedia
        candidate_entities = [r["head"], r["tail"]]
        entities = [self.get_wikipedia_data(ent) for ent in candidate_entities]

        # if one entity does not exist, stop
        if any(ent is None for ent in entities):
            return

        # manage new entities
        for e in entities:
            self.add_entity(e)

        # rename relation entities with their wikipedia titles
        r["head"] = entities[0]["title"]
        r["tail"] = entities[1]["title"]

        # add source if not in kb
        article_url = list(r["meta"].keys())[0]
        if article_url not in self.sources:
            self.sources[article_url] = {
                "article_title": article_title,
                "article_publish_date": article_publish_date
            }

        # manage new relation
        if not self.exists_relation(r):
            self.relations.append(r)
        else:
            self.merge_relations(r)

    def print(self):
        print("Entities:")
        for e in self.entities.items():
            print(f"  {e}")
        print("Relations:")
        for r in self.relations:
            print(f"  {r}")
        print("Sources:")
        for s in self.sources.items():
            print(f"  {s}")
  
  def get_article(url):
      article = Article(url)
      article.download()
      article.parse()
      return article

 def from_url_to_kb(url):
     article = get_article(url)
     config = {
        "article_title": article.title,
        "article_publish_date": article.publish_date
     }
     kb = from_text_to_kb(article.text, article.url, **config)
     return kb
    

Then, we use the GoogleNews library to get the URLs of recent news articles about a specific topic. Once we have multiple URLs, we feed them to the from_urls_to_kb function, which extracts a knowledge base from each article and then merges them together.

def get_news_links(query, lang="en", region="US", pages=1, max_links=100000):
    googlenews = GoogleNews(lang=lang, region=region)
    googlenews.search(query)
    all_urls = []
    for page in range(pages):
        googlenews.get_page(page)
        all_urls += googlenews.get_links()
    return list(set(all_urls))[:max_links]

def from_urls_to_kb(urls, verbose=False):
    kb = KB()
    if verbose:
        print(f"{len(urls)} links to visit")
    for url in urls:
        if verbose:
            print(f"Visiting {url}...")
        try:
            kb_url = from_url_to_kb(url)
            kb.merge_with_kb(kb_url)
        except ArticleException:
            if verbose:
                print(f"  Couldn't download article at url {url}")
    return kb
import pickle

def save_kb(kb, filename):
    with open(filename, "wb") as f:
        pickle.dump(kb, f)

def load_kb(filename):
    res = None
    with open(filename, "rb") as f:
        res = pickle.load(f)
    return res
    

Let’s try extracting a knowledge base from three articles from Google News about “Google.”

news_links = get_news_links("Google", pages=1, max_links=3)
kb = from_urls_to_kb(news_links, verbose=True)
kb.print()
# 3 links to visit
# Visiting https://www.hindustantimes.com/india-news/google-doodle-celebrates-india-s-gama-pehlwan-the-undefeated-wrestling-champion-101653180853982.html...
# Visiting https://tech.hindustantimes.com/tech/news/google-doodle-today-celebrates-gama-pehlwan-s-144th-birth-anniversary-know-who-he-is-71653191916538.html...
# Visiting https://www.moneycontrol.com/news/trends/current-affairs-trends/google-doodle-celebrates-gama-pehlwan-the-amritsar-born-wrestling-champ-who-inspired-bruce-lee-8552171.html...
# Entities:
#   ('Google', {'url': 'https://en.wikipedia.org/wiki/Google',
#     'summary': 'Google LLC is an American ...'})
#   ...
# Relations:
#   {'head': 'Google', 'type': 'owner of', 'tail': 'Google Doodle',
#     'meta': {'https://tech.hindustantimes.com/tech/news/google-doodle-today-celebrates-gama-pehlwan-s-144th-birth-anniversary-know-who-he-is-71653191916538.html':
#       {'spans': [[0, 128]]}}}
#   ...
# Sources:
#   ('https://www.hindustantimes.com/india-news/google-doodle-celebrates-india-s-gama-pehlwan-the-undefeated-wrestling-champion-101653180853982.html',
#     {'article_title': "Google Doodle celebrates India's Gama Pehlwan, the undefeated wrestling champion",
#     'article_publish_date': datetime.datetime(2022, 5, 22, 6, 59, 56, tzinfo=tzoffset(None, 19800))})
#   ('https://tech.hindustantimes.com/tech/news/google-doodle-today-celebrates-gama-pehlwan-s-144th-birth-anniversary-know-who-he-is-71653191916538.html',
#     {'article_title': "Google Doodle today celebrates Gama Pehlwan's 144th birth anniversary; know who he is",
#     'article_publish_date': datetime.datetime(2022, 5, 22, 9, 32, 38, tzinfo=tzoffset(None, 19800))})
#   ('https://www.moneycontrol.com/news/trends/current-affairs-trends/google-doodle-celebrates-gama-pehlwan-the-amritsar-born-wrestling-champ-who-inspired-bruce-lee-8552171.html',
#     {'article_title': 'Google Doodle celebrates Gama Pehlwan, the Amritsar-born wrestling champ who inspired Bruce Lee',
#     'article_publish_date': None})
    

As the knowledge bases expand, we now have 10 entities, 10 relations, and 3 distinct sources. It is important to note that we have information about the source article for each relation.

7. Visualize the generated knowledge bases

Let’s visualize the output of our work by plotting the knowledge bases. As our knowledge bases are graphs, we can use the pyvis library to create interactive network visualizations.

We define a save_network_html function that:

  1. Initializes an empty directed pyvis network.
  2. Adds the knowledge base entities as nodes.
  3. Adds the knowledge base relations as edges.
  4. Saves the network in an HTML file.
def save_network_html(kb, filename="network.html"):
    # create network
    net = Network(directed=True, width="700px", height="700px", bgcolor="#eeeeee")

    # nodes
    color_entity = "#00FF00"
    for e in kb.entities:
        net.add_node(e, shape="circle", color=color_entity)

    # edges
    for r in kb.relations:
        net.add_edge(r["head"], r["tail"],
                    title=r["type"], label=r["type"])
        
    # configure network
    net.repulsion(
        node_distance=200,
        central_gravity=0.2,
        spring_length=200,
        spring_strength=0.05,
        damping=0.09
    )
    net.set_edge_smooth('dynamic')

    # save network HTML
    net.show(filename, notebook=False)

Let’s try the save_network_html function with a knowledge base built from 20 news articles about “Google.”

news_links = get_news_links("Google", pages=5, max_links=20)
kb = from_urls_to_kb(news_links, verbose=True)
save_network_html(kb, filename="network_3_google.html")
IPython.display.HTML(filename="network_3_google.html")

This is the resulting graph:

Closely examining the graph, we observe a multitude of relations that have been extracted involving the Google entity.

It’s worth noting that while not visually represented, the knowledge graph retains valuable information regarding the source and metadata of each relation, such as the articles from which they were extracted. Additionally, the knowledge graph incorporates Wikipedia data for each entity. While visualizing knowledge graphs is useful for debugging, their true benefits shine when utilized for inference purposes.

Launch your project with LeewayHertz!

With a deep understanding of and expertise in knowledge graphs, we build highly intelligent and context-aware AI systems suited to your business needs.

Applications of knowledge graphs

Knowledge graphs have become increasingly popular and are being applied in various domains due to their ability to represent, integrate, and leverage structured data. This section will explore some key applications of knowledge graphs and how they are greatly impacting different fields.

Search and recommendation systems

In search engines, traditional keyword-based matching often fails to provide accurate and relevant results. Knowledge graphs address this limitation by incorporating semantic information and contextual relationships. By representing knowledge in a graph structure, search engines can capture the intricate connections between entities and their attributes, leading to more precise search results.

One of the key technical aspects of using knowledge graphs in search is entity extraction and disambiguation. Entities are identified and linked to the corresponding nodes in the knowledge graph, enabling a deeper understanding of the user’s query and its context. This process involves natural language processing techniques like named entity recognition and entity linking.

Knowledge graphs in recommendation systems are vital in capturing user preferences, item characteristics, and relationships between entities. Collaborative filtering techniques and knowledge graph embeddings allow for more accurate and personalized recommendations. Graph-based recommendation models leverage the rich contextual information present in the knowledge graph to identify similar items, recommend relevant content, and discover hidden connections between users and items.

Various technologies and frameworks can be used to implement search and recommendation systems with knowledge graphs. Graph databases like Neo4j and RDF stores like Virtuoso provide efficient storage and querying capabilities for knowledge graph data. Natural language processing libraries such as NLTK and spaCy help with entity recognition and text processing. Additionally, machine learning frameworks like TensorFlow and PyTorch enable the development of recommendation models that leverage graph-based features.

Question answering and natural language processing

Knowledge graphs are valuable for natural language processing tasks, including question-answering and information retrieval. By leveraging structured knowledge, they enable machines to understand and process natural language queries more effectively.

One technical aspect of question-answering systems is query understanding and semantic parsing. Knowledge graphs provide a rich source of structured data that helps interpret the user’s query and identify the relevant entities, relationships, and attributes within the graph. This process involves techniques such as semantic parsing, entity extraction, and relation extraction.

Once the query is understood, knowledge graphs facilitate efficient information retrieval by traversing relationships and associations within the graph. Graph-based algorithms like random walk and graph neural networks enable the exploration of the graph structure to retrieve relevant information and generate accurate answers.

Semantic similarity measures and graph-based algorithms also aid in answer ranking and selection. Question-answering systems can identify the most relevant answer candidates by comparing the semantic similarity between the query and potential answers within the knowledge graph.

Technologies and frameworks commonly used in question answering and NLP tasks with knowledge graphs include graph query languages like SPARQL and Cypher, natural language processing libraries like Transformers and AllenNLP, and graph-based algorithms implemented using graph analytics platforms like Apache Giraph or GraphX.

Data integration and knowledge management

Knowledge graphs play a crucial role in data integration and knowledge management, particularly in enterprises dealing with diverse and large volumes of data. By representing data as interconnected entities and relationships, knowledge graphs facilitate the integration of data from multiple sources, ensuring a unified view of the information.

One of the key technical aspects of data integration using knowledge graphs is data extraction and transformation. Raw data from various sources need to be extracted, cleaned, and transformed into a standardized format compatible with the knowledge graph schema. This process often involves techniques such as data wrangling, data cleaning, and schema mapping.

Knowledge graphs employ entity resolution and data linking techniques to achieve data integration. Entity resolution involves identifying and resolving duplicate or similar entities across different data sources, ensuring a consistent representation within the knowledge graph. Data linking establishes connections and relationships between entities based on shared attributes or related information.

Ontology and schema design are essential technical considerations in knowledge graph-based data integration. A well-defined ontology or schema provides a structured framework for organizing entities, relationships, and attributes within the knowledge graph, ensuring semantic consistency and interoperability.

Technologies used in data integration and knowledge management with knowledge graphs include ETL (Extract, Transform, Load) tools like Apache NiFi and Talend, graph data integration platforms like Virtuoso and AllegroGraph, and ontology modeling languages like OWL and RDF Schema.

Examples of knowledge graphs

Many tech companies and other service businesses now use knowledge graphs. The following are some of the most well-known consumer-facing knowledge graphs.

  1. Google knowledge graph: Google uses a knowledge graph to enhance its search results by providing relevant information about entities such as people, places, and things. For example, searching for a famous person may see a sidebar with key details, related topics, and connections to other entities.
  2. Wikidata knowledge graph: Wikidata is a community-driven knowledge graph that provides structured data for Wikipedia and other Wikimedia projects. It contains information about various topics, including people, places, events, and concepts. Wikidata allows users to query and explore interconnected data in various ways.
  3. DBpedia knowledge graph: DBpedia is an effort to extract structured information from Wikipedia and create a knowledge graph based on that data. It represents Wikipedia’s content in a structured format, allowing users to explore and query information about entities mentioned in Wikipedia articles.
  4. Microsoft academic graph: Microsoft academic graph is a knowledge graph that focuses on academic research. It contains information about scholarly publications, authors, institutions, conferences, and their relationships. It is used to power various academic search and recommendation systems.
  5. WordNet knowledge graph: WordNet is one of the most widely embraced and comprehensive lexical knowledge graphs available, encompassing words from over 200 languages. Its primary purpose is to offer users definitions and synonyms, aiding in exploring semantic connections between words. Frequently leveraged in natural language processing (NLP) and search applications, the WordNet knowledge graph is crucial in enhancing their performance.

Impact of knowledge graphs on various industries

Knowledge graphs have various applications across diverse industries, significantly impacting how organizations manage, analyze, and derive insights from their data. Here are some significant use cases of knowledge graphs impacting major industries:

  1. Healthcare and life sciences:
    • Patient care and treatment: Knowledge graphs help understand patient medical histories, identify potential drug interactions, and enable personalized treatment plans.
    • Drug discovery and development: Knowledge graphs aid in integrating and analyzing vast amounts of biological and chemical data to identify potential drug targets, predict drug efficacy, and support research and development efforts.
    • Clinical research and trials: Knowledge graphs facilitate the integration of clinical trial data, patient records, and medical literature, enabling researchers to uncover patterns, identify suitable patient cohorts, and accelerate the discovery of new treatments.
  2. Finance and banking:
    • Fraud detection: Knowledge graphs assist in detecting and preventing financial fraud by connecting and analyzing data from multiple sources, identifying patterns, and detecting anomalies in transaction data.
    • Risk management: Knowledge graphs help assess and manage risks by integrating and analyzing data from various domains, such as market data, customer profiles, and regulatory information, to provide a holistic view of potential risks.
    • Compliance and regulatory reporting: Knowledge graphs aid in compliance management by integrating regulatory requirements, policies, and customer data to ensure regulation adherence and simplify reporting processes.
  3. E-commerce and retail:
    • Personalized recommendations: Knowledge graphs enable personalized product recommendations by analyzing customer behavior, preferences, and historical data to provide tailored suggestions and improve customer engagement and conversion rates.
    • Supply chain optimization: Knowledge graphs facilitate supply chain optimization by integrating and analyzing data from various sources, enabling better inventory management, demand forecasting, and logistics planning.
    • Customer service and support: Knowledge graphs help improve customer service and support by integrating customer data, product information, and support documentation to provide accurate and context-aware assistance to customers.
  4. Media and entertainment:
    • Content recommendations: Knowledge graphs enable personalized content recommendations by analyzing user preferences, viewing history, and social interactions to deliver relevant and engaging content across various media platforms.
    • Content discovery and metadata management: Knowledge graphs assist in organizing and managing content metadata, including information about genres, actors, directors, and related entities, in improving content discovery and enabling efficient content indexing and search.
    • Rights management and licensing: Knowledge graphs help manage and track media rights, licensing agreements, and content distribution by connecting entities such as content creators, distributors, and rights holders.
  5. Manufacturing and industrial operations:
    • Supply chain visibility: Knowledge graphs provide a holistic view of the supply chain by integrating data from multiple sources, enabling better tracking, monitoring, and optimizing inventory, production, and logistics.
    • Predictive maintenance: Knowledge graphs aid in predicting equipment failures and optimizing maintenance schedules by integrating data from sensors, maintenance logs, and historical records to identify patterns and anomalies.
    • Quality control and defect detection: Knowledge graphs assist in quality control by integrating data from production processes, sensor data, and inspection records to identify potential defects, patterns, and root causes of quality issues.
  6. Logistics and supply chain management:
    • Freight routing and optimization: Knowledge graphs assist in optimizing freight transportation by considering factors such as delivery deadlines, cargo types, carrier availability, and transportation regulations, leading to efficient and cost-effective logistics operations.
    • Fleet management: Knowledge graphs help manage and track fleet vehicles by integrating data on vehicle performance, maintenance schedules, fuel consumption, and driver behavior, enabling better decision-making and resource allocation.
    • Last-mile delivery: Knowledge graphs aid in optimizing last-mile delivery by considering factors such as customer location, inventory availability, traffic conditions, and delivery preferences, resulting in timely and efficient deliveries.
  7. Intelligent transportation systems:
    • Traffic management: Knowledge graphs help analyze real-time traffic data, historical patterns, and weather information to optimize traffic flow, reduce congestion, and improve overall transportation efficiency.
    • Route planning and optimization: Knowledge graphs aid in providing optimal routes for vehicles by considering factors such as traffic conditions, road restrictions, and user preferences, resulting in reduced travel time and fuel consumption.
    • Public transportation planning: Knowledge graphs enable the integration of public transportation schedules, routes, and passenger data to optimize public transit services, improve connectivity, and enhance the overall passenger experience.

Future implications of knowledge graphs

The future implications of knowledge graphs are significant and far-reaching. As knowledge graphs continue to evolve and advance, they will have a transformative impact on various aspects of our lives. Here are some key future implications of knowledge graphs:

Knowledge graph interoperability: As knowledge graphs continue to evolve and expand across different domains and organizations, there is a need for improved interoperability. Efforts are underway to develop standards and frameworks enabling seamless integration and exchange of knowledge across multiple graphs. This interoperability will enhance collaboration, data sharing, and cross-domain insights.

Hybrid knowledge graphs: The future of knowledge graphs lies in integrating structured and unstructured data sources. Hybrid knowledge graphs combine structured knowledge graphs with unstructured data sources such as text, images, and videos. Advanced techniques, including natural language processing and computer vision, will be employed to extract and incorporate information from unstructured sources, enriching the knowledge graph with a broader range of data.

Dynamic and real-time knowledge graphs: Knowledge graphs evolve from static representations to dynamic and real-time systems. Real-time data streaming and continuous updates will enable knowledge graphs to capture the most up-to-date information and reflect changing relationships and contexts. This real-time aspect will be crucial for applications such as fraud detection, supply chain optimization, and personalized recommendations.

Explainable and transparent knowledge graphs: With the increasing adoption of knowledge graphs in critical domains, there is a growing demand for explainability and transparency. Future knowledge graphs will focus on providing explanations and justifications for the relationships and recommendations they make. Techniques such as rule-based reasoning and provenance tracking will enhance knowledge graphs’ transparency and trustworthiness.

Federated knowledge graphs: Federated knowledge graphs involve the integration of multiple knowledge graphs across different organizations or domains. These interconnected graphs allow for the seamless sharing and discovery of knowledge while respecting privacy and data ownership. Federated knowledge graphs will enable collaborative research, data integration, and insights generation on a larger scale.

Machine learning integration: The integration of machine learning techniques within knowledge graphs will continue to advance. Knowledge graphs can serve as a valuable source of training data for machine learning models, enhancing their performance in various tasks. On the other hand, ML algorithms can be used to derive insights, predict missing relationships, and enhance the quality of knowledge graphs.

Contextual and personalized knowledge graphs: The future of knowledge graphs lies in providing more contextual and personalized experiences. Knowledge graphs will incorporate user-specific preferences, contexts, and historical data to deliver personalized recommendations, tailored search results, and customized insights. This personalized approach will enhance user engagement and satisfaction.

Ethical considerations: As knowledge graphs become more pervasive and influential, ethical considerations will play a crucial role. Future directions of knowledge graphs will focus on addressing ethical challenges such as data privacy, bias mitigation, and fair representation. Efforts will be made to ensure knowledge graphs are designed and deployed ethically and responsibly.

Endnote

In an era dominated by vast amounts of data, knowledge graphs have emerged as a powerful tool for effective data governance. They provide a structured and interconnected representation of knowledge, enabling us to make sense of complex data landscapes and derive meaningful insights.

Throughout this article, we have explored the evolution and development of knowledge graphs. From their early roots in symbolic AI and knowledge-based systems to the emergence of the semantic web and the advancements in graph theory and databases, knowledge graphs have evolved as a comprehensive framework for organizing and connecting information. The importance of knowledge graphs in today’s information age cannot be overstated. By capturing relationships between entities, knowledge graphs provide context and meaning to data. They facilitate data integration, enabling us to break down data silos and create comprehensive knowledge bases that support data discoverability, consistency, and interoperability.

Knowledge graphs also have significant implications for various industries and applications. They enhance search engines and information retrieval systems, enabling more accurate and relevant results. They power personalized recommendations and intelligent assistants, enhancing user experiences and driving engagement. In domain-specific contexts, such as healthcare and finance, knowledge graphs enable advanced semantic reasoning and decision-support systems. Understanding knowledge graphs is vital for effective data governance. By embracing knowledge graphs, organizations can harness the power of interconnected knowledge, make informed decisions, and unlock new opportunities for innovation. As we continue to navigate the vast seas of data, knowledge graphs will serve as our compass, guiding us toward a deeper understanding of the world around us.

Knowledge graphs are not just a technological concept but a key enabler for effective data governance. By harnessing their potential, we can navigate the complexities of data, unlock valuable insights, and pave the way for a more intelligent and connected future.

With our deep expertise in knowledge graphs, we create insightful and contextually-aware AI solutions for your business. Start your AI journey with LeewayHertz today!

Listen to the article

What is Chainlink VRF

Author’s Bio

 

Akash Takyar

Akash Takyar LinkedIn
CEO LeewayHertz
Akash Takyar is the founder and CEO of LeewayHertz. With a proven track record of conceptualizing and architecting 100+ user-centric and scalable solutions for startups and enterprises, he brings a deep understanding of both technical and user experience aspects.
Akash's ability to build enterprise-grade technology solutions has garnered the trust of over 30 Fortune 500 companies, including Siemens, 3M, P&G, and Hershey's. Akash is an early adopter of new technology, a passionate technology enthusiast, and an investor in AI and IoT startups.

Related Services

Machine Learning Development

Transform your data into a strategic asset. Our ML development services help you achieve operational excellence through tailored data-driven AI solutions.

Explore Service

Start a conversation by filling the form

Once you let us know your requirement, our technical expert will schedule a call and discuss your idea in detail post sign of an NDA.
All information will be kept confidential.

Related Insights

Follow Us