Sometime in 2014, Bastian Obermayer, a reporter working for the Süddeutsche Zeitung newspaper, was sent 11.5 million leaked documents about off-shore financial operations, later known as the Panama Papers. It took over a year of analysis by journalists in 80 countries to dig into this immense corpus.

Making sense of the contents of a large set of unknown documents is relevant to many industry applications. In investigative journalism or intelligence operations, quickly identifying the most important documents could be a make-or-break effort. Other times, understanding an emerging or evolving domain can be valuable. For example, imagine mapping new debates on teenage use of digital devices and mental health, or the evolution of topics covered over the decades by popular newspapers.

We developed a prototype pipeline to help address this broad use case as part of our work with Vibrant Data Labs. The pipeline is designed to ingest an arbitrary set of documents, produce a hierarchical visualization of the contents, and finally make the corpus searchable by tagging each document with both specific and broad keywords. 

In this post, we’ll present the pipeline’s methodological design. As we’ll see, its harnessing of the power of deep NLP models, both open-source as well as available on Primer Engines, opens up new ways of tackling text analysis tasks.

Task overview

The task requires carrying out three main steps.

First, we need to understand what each document is about. Next, we want to look across the corpus to get a big-picture view of what it covers. Importantly, we’ll want to determine which domains are broad and which are more specific. Finally, we want to use this hierarchical lens to tag the documents in a consistent manner, thereby exposing them for search.

Let’s look at each step in turn.

Document representation

Document representation means converting a text document to a vector representing what the document is about. Commonly used approaches include applying a TF-IDF transformation or calculating document embeddings.

Both approaches, however, have some severe shortcomings for our purpose. TF-IDF offers interpretable representations but ignores semantic similarity. Document embeddings understand such semantics, but come at the price of opaque, and therefore not searchable, document representations. 

Our solution uses Primer’s Abstractive Topic Generation Engine. Its power comes from its understanding of both semantics and context, offered by its underlying deep language model, combined with the generation of plain-language outputs.

With this Engine, we are able to dramatically cut through the complexity of free text and reduce documents to a handful of selected, highly-relevant, and intelligible topic labels.

Get a high-level view across documents 

The next challenge is going from the individual document representations to a zoomed out view of the corpus content. 

Traditionally, one could tackle this by clustering similar documents together or using topic modelling techniques like Latent Dirichlet Allocation.

We take a different approach. Instead of grouping the documents, we work on the extracted topic terms and learn the relations in that set. To do so, we carry out two simple steps using off-the-shelf tools:

  1. To measure semantic distance between terms, we project these into a vector embedding space using SentenceBERT, an open-source sentence embedding model. 
  2. We use agglomerative clustering – a bottom-up hierarchical clustering method – to extract a tree data structure connecting related terms into ever broader groups sharing semantic proximity. 

This is how the pipeline can learn that ‘washing machine’ and ‘dishwasher’ are related, as are ‘oven’ and ‘microwave’, and as we look across a wider semantic radius, these will eventually fall in a common group. 

Understanding the hierarchy across domains is key to making the corpus searchable, as user queries will range from specific to broad. 

One crucial step is still missing though: how do we label this broader group? The richness of deep language models comes to our aid here. We’ve found that simply selecting the term that is most similar – based on embedding similarity – to the other terms in the group yields a pretty good representative item. 

Let’s look at the concrete example below. Strikingly, the term ‘saucepan’ which combines elements of both pots and pans, indeed emerges as the most central term. On the other hand, ‘wok’ and ‘teflon pan’ , which can be thought of as specific types of pans, are at the bottom of the ranking as representative terms.

RankTermCentrality Score
0saucepan3.180589
1pan3.028817
2pot2.984785
3wok2.769623
4teflon pan2.728998

Moreover, the selected representative terms become more conceptually broad and abstract as we seek to label more diverse groups of terms. We can see this behavior in the examples below, wherein the more abstract term is chosen as the most representative when set alongside two related but semantically distinct terms.

RankTermCentrality Score
0kitchen1.787934
1microwave1.749099
2pan1.659810
RankTermCentrality Score
0dining1.893527
1plates1.890686
2table cloth1.887701

By virtue of this feature, these two simple steps allow building a structured view over topic space in the corpus, offering both narrow and broad perspectives.

Search documents using hierarchical relationships

Finally, we can proceed to tag the original documents and power the search functionality. Starting from the original document topic labels, we use the relationships in the term tree to ensure each document is also linked to corresponding, more abstract, domains. For example, depending on the extracted hierarchy, ‘microwave’ could also be tagged as ‘cooking’, ‘household items’, ‘consumer durables’, ideally with a declining relatedness score as we move further up these abstractions. Notably, this means that microwave products would now be picked-up in searches for both microwaves specifically as well as household appliances in general.

Augmenting insights into your data with deep NLP

We’ve tried this tool on different types of datasets and found it can provide valuable initial insights out-of-the-box. We’ve played around with book blurbs, news documents, academic papers on COVID-19, and documents made public by the CIA. In all cases, we were able, at a minimum, to get an immediate understanding of what the documents were about and to have a way of searching the documents we were most interested in. 

Clustering documents or extracting topics are not new tasks in the domain of unsupervised learning. However, the above workflow differs from these traditional approaches by drawing on the vast additional insight offered by deep language models, such as the Abstractive Topics Engine and SentenceBERT.  Without language models, one would be limited to making sense of documents only based on the distribution of features in the local corpus. Instead, modern NLP can interpret documents, even in small datasets, using the understanding gained over the vast training corpus that is embedded in the language model itself. 

This is the design choice of the Primer Engines, where powerful NLP models are exposed via an API to support the creation of composable NLP pipelines on customer documents. 

Are you curious to see how our hierarchical topic prototype works in practice? Have a look at our tutorial where we create a product inventory from item descriptions from Amazon.

Further reading

SentenceTransformers Documentation

Agglomerative Clustering example, Wikipedia

Agglomerative Clustering, Scikit-learn User Guide

AgglomerativeClustering implementation, Scikit-learn Documentation

Original paper introducing Hierarchical Latent Dirichlet Allocation

We create the tools behind the decisions that change the world. ©2022 Primer

Distilling the contents of a large set of unknown free text documents and understanding how they relate to each other is relevant to many industry applications. This tutorial will take you through the HierarchicalTagger, a pipeline of NLP models created to address this broad use case. 

Combining the power of Primer Engines with a custom prototype built on top of deep NLP models, the pipeline is designed to ingest an arbitrary set of documents, produce a hierarchical visualization of their contents, and finally make the corpus searchable by tagging each document with both specific and broad keywords. Check out this post if you would like to know more about how it works.

In this tutorial, we’re going to run the pipeline on a dataset from Amazon to create a product inventory from raw item descriptions. To run this tutorial, you’ll need to sign up for free for Primer Engines, our pre-trained models built for developers and data scientists.

At the end of this process, we’ll have a visualization like the one above, showing the hierarchy of topics covered by the product descriptions. We’ll also have a tagged and searchable document set containing both narrow and wide keywords. Finally, we will run a simple web app exposing both the visualization and search functionality to the user via a UI. 

Here we go!

Set-up

We’ll start by getting access to the relevant code, data and additional packages. In a terminal, clone our GitHub repository and navigate to its root directory:

$ git clone https://github.com/PrimerAI/primer-hierarchical-tagger.git$ cd primer-hierarchical-tagger

The full code for the workflow we’ll be going through here can be found in the examples folder, and we’ll use the webapp folder to spin up the app. If you want to have a look at the internals of the pipeline, checkout the code and comments in the HierarchicalTagger class.
Next, let’s create a separate environment for our code to run in. We used virtualenvwrapper, but you can use your favorite method as well. We ran the these commands to create the environment and install the required packages:

$ mkvirtualenv ht-repo
$ pip install --upgrade pip
$ pip install -r requirements.txt

The instructions in the requirements.txt will install all package dependencies, including the hierarchical_tagger module itself. 

Finally, download the Amazon Product Dataset 2020 and save it in the examples/data/ folder. We renamed the file to amazon_products_2020.csv.

We are now ready to launch a jupyter notebook. We do so from the root folder of the repository and add its path to the PYTHONPATH. This will allow us to call any python modules found in the root of the repository from inside the notebook.

$ PYTHONPATH=$(pwd) jupyter notebook

Open up the amazon-product-descriptions.ipynb in the examples folder. Run the cells under the Set-up section to make sure all required packages are imported and paths are set-up correctly. 

That’s it, we’re good to go!

Generate abstractive topics via Engines

We are ready to tackle the first substantive step in the pipeline: understanding what each document is about. Primer’s Abstractive Topic Generation Engine is very well suited for this step. Given a raw text document, the engine generates a handful of selected, highly-relevant and intelligible topic labels.

In practice, we would now hit the Primer APIs with batches of documents for processing and receive the desired results back. So that you can proceed directly to the next steps, we’ve done this for you, and included the processed results for a random sample of 3,000 products in this file. Feel free to save your Engines credits and proceed to the next section.

However, running the pipeline on your own data is easy and we’ve included everything you need to get going. 

First, sign-up for an Engines free trial here. You get 2,000 free credits, which will cover processing for up to 2,000 short documents. If you would like to try the pipeline on a larger document set, please email community@primer.ai to request additional credits.

Once you obtain an API KEY, save it in credentials.py as ENGINES_API_KEY="YOUR_ENGINES_API_KEY". This file is outside of version control so it won’t be revealed to others. This way you can also import the key into the notebook, instead of hard-coding it.

from credentials import ENGINES_API_KEY

Next, you’ll need to massage your documents into the standard format expected by the Abstractive Topic Generation Engine: a list of dictionaries with an id and a text key. This is how we did this for the Amazon Product Dataset. You would need to edit this line according to the format of your own data.

documents = [{"id": r["Uniq Id"], "text": r["About Product"]} for i, r in sampled_items.iterrows()]

We created the infer_model_on_docs helper function that will take care of the communication with the Engines API for you. There is a LOT going on under the hood in this function. API calls are asynchronous, which means ‘waiting time’ while expecting a result from the API servers can be used productively by carrying out other operations in the program, for example to trigger other concurrent requests, or process the results from previous requests made. This increases the document processing throughput. Additionally, the functions below also allow batching of documents, so that a single request can return results on multiple documents. Finally, we use the tenacity module to carry out automatic retries when facing transient errors. 

You can find the code to send the documents to the Abstractive Topics Engine below. The cell processes a chunk of documents at a time and saves the results to a file at each iteration (just to be extra safe!). While the code above hides away all the internal complexity of the API calls for the users’ convenience, it’s always a good idea to test the API call on 1 or 2 documents to check everything is in shape before triggering a job on a large list of documents. For a test run, simply replace the documents list with a small slice of the document set:

test_documents = documents[:2]

If all is in good shape, you can kick off the document processing:

ITEM_TOPICS = os.path.join(ROOT_DIR, "./examples/data/amazon_products.json")

topics = {}

# Infer topics from Engines
for doc_chunk in chunked(documents, 100):
    topics_results = await infer_model_on_docs(doc_chunk, 
                                               model_name="abstractive_topics", 
                                               api_key=ENGINES_API_KEY, 
                                               batch_size=10,
                                               **{"segmented": False})
    topics.update(topics_results)
    print(f"Collected topics for {len(topics)} documents")
    # Save
    with open(ITEM_TOPICS, "w") as f:
        json.dump(topics, f)

It’s probably time to make a coffee. The Abstractive Topics Model would not get along well with the Macintosh you had as a kid: it’s one of those heavyweight NLP models with over 100 million parameters that need GPUs to run efficiently at scale. But after some minutes of waiting, you’ll be able to inspect the topic labels by product id in the topics dictionary.

Ingest the processed docs into the HierarchicalTagger pipeline

Whether you ran Engines on your own data or used our precomputed dataset, you’ll be able to access the document topic representation like this:

topics["96d96237978ba26bbc6baa437372527a"]

OUT: {'topics': ['T6', 'Hover Board', 'Hover Scooter', 'Off Road'],
 'segmented_topics': [['T6', 'Hover Scooter', 'Hover Board', 'Off Road']],
 'segments': ['Make sure this fits by entering your model number. | FOR ALL RIDERS – The T6 can handle up to 420 lbs., making it the best choice for riders of all shapes and sizes! | ALL TERRAIN - Roll over bumps and inclines up to 30° as you travel through mud, grass, rain, and even gravel. | 12 MILE RANGE - The T6 off road hover board has a 12-mile range, and the capability to reach powered speeds of up to 12 MPH. | 10" RUGGED TIRES - Dual rugged, 10" tubeless tires designed for all terrain exploration. | ROCK WHILE YOU RIDE –The self-balancing hover scooter uses Bluetooth to play music directly from your phone.']}

It’s time to start-up the pipeline and load in our processed topics. The code below creates a HierarchicalTagger instance. It might take some moments the first time you run it, as it will download the SentenceBERT language model.

from hierarchical_tagger.hierarchical_tagger import HierarchicalTagger
hierarchical_tagger = HierarchicalTagger()

Next, we send the documents and their corresponding topic labels for ingest:

document_topics = {document_id: topics_entry['topics'] for document_id, topics_entry in topics.items()}
hierarchical_tagger.ingest(document_terms=document_topics)

This step is the most computationally demanding as it involves transforming all topic terms into a vector embedding space using the SentenceBERT language model. This is a fundamental step as it will allow the pipeline to measure semantic distance between terms. To avoid having to repeat this, we can save our HierarchicalTagger instance to a json file, using the .to_json() helper method. This file will also be the input data to our web app, so let’s save it in webapp/data/:

SERIALIZED_INSTANCE_PATH = os.path.join(ROOT_DIR, "./webapp/data/amazon_products.json")
with open(SERIALIZED_INSTANCE_PATH, "w") as f:
    f.write(hierarchical_tagger.to_json())

If we ever want to load up our instance again at a later date, we can simply run:

with open(SERIALIZED_INSTANCE_PATH, "r") as f:
    reloaded_serialized =  json.load(f)
hierarchical_tagger = HierarchicalTagger.from_dict(reloaded_serialized)

Build the topic tree and tag the documents

Next, we want to look across the corpus to get a big-picture view of the topics it spans and how these relate to each other. In particular, we want to learn the hierarchical relationships between the topics. With that goal, we use agglomerative clustering – a bottom-up hierarchical clustering method – to extract a tree data structure connecting related terms into ever broader groups sharing semantic proximity. 

The simplest way to try this is by calling the .fit_tag_tree() method. This populates the .tree attribute with a treelib object representing the extracted term tree. This can be manipulated and explored with all the treelib methods, for example .show() to print out a text representation of the tree.

ht.fit_tag_tree()
ht.tree.show()
​​toy
├── crafts
│   └── vehicle
│       ├── cars
│       │   ├── 4wd monster truck
│       │   │   └── monster truck
│       │   ├── automotive industry
│       │   │   ├── automotive design
│       │   │   └── ford mustang
│       │   ├── car racing
│       │   ├── cars cars 3
│       │   │   └── toy cars
│       │   ├── hover board
│       │   │   ├── skateboarding
│       │   │   │   ├── chalkboard
│       │   │   │   └── skates

The final step is tagging the original documents based on the hierarchy we found in the tree, and exposing them for search. Once again, a default call to .tag_documents() will do the trick. The results will be in the .document_tags attribute: a dictionary mapping document id to a list of tuples of the form (term, score, node_id) sorted by descending score. score measures how close in meaning the term is to the document. We would expect higher level abstractions to have lower scores. node_id loosely indicates how high the node is in the tree: it’s not a perfect measure, but more abstract terms will generally have higher node ids.

ht.tag_documents()
ht.document_tags # {doc_id : [(tag, score, approximate hierarchy level), ...]}

Here’s how the pipeline performed for the ‘hover board’ item we saw above.

hierarchical_tagger.document_tags["96d96237978ba26bbc6baa437372527a"]

OUT: [('hover scooter', 0.5336780615328567, 1221),
 ('skateboarding', 0.5263492531558106, 1654),
 ('electric scooter', 0.4128446044700507, 1766),
 ('skates', 0.3843635235742917, 1824),
 ('sports', 0.2278761348235242, 1933),
 ('car', 0.17362573115847066, 1981)]

As you can see, the top-scoring tag is spot-on to the specific item description. Beyond that, although not perfect, the pipeline partially succeeds in mapping the item to higher level concepts that, importantly, were not present in the original document representation. It assigns a medium-strength (0.41) tag of ‘electric scooter’, linking the document with other items related to the broader concept of electric mobility. Similarly, the ‘skates’ tag establishes a link with other highly-related sporting equipment. The pipeline also makes a very accurate link to the much broader domain of ‘sports’, with a low score correctly measuring the notable degree of abstraction between the specific item and this high-level concept. Having this sort of tag immediately translates into an improved search experience.

Tuning and human-in-the-loop

Of course, some things are off, with the low-scoring tag of ‘car’ not really fitting in with the hoverboard example. It would be great if the AI ‘just got it’ out-of-the-box, but that’s generally not how it happens. Instead, our aim is for the tool to kick-start, and then significantly enhance, the investigation efforts of the human-in-the-loop. 

To this end, we expose several tuning parameters that the investigator can tweak to guide the extraction of the term tree and the logic applied when tagging the documents. The analyst can also feed their own domain knowledge by suggesting additional terms to be included in the tree. This input could even be estimated from a different corpus; what we might think of as a loose form of transfer learning. 

For example, one could run the pipeline on descriptions of toys from Amazon to extract an initial term tree, then feed that set of terms as suggestions to the pipeline when analyzing a corpus of letters to Santa. 

The point here is that the exploration workflow is likely to be iterative: starting from a set of unknown documents, the investigator can repeatedly run the pipeline, using the pipeline options as levers to guide the results toward better results each time.

Additionally, the tagging from this tool can be used to generate pseudo-labels to train custom models. For example, say you need a classifier to identify sports items in a larger set of product descriptions. Just using a few sports-related tags from the pipeline will immediately give a pseudo-labelled training dataset. Uploading that data into Primer Automate, you could have a trained model in just a few clicks.

Exploring the corpus with the web app

We have created a simple web app to facilitate this iterative exploration. Running the following command will launch the app at the http://localhost:8501/ address.

$ workon ht-repo # Or alternative command to activate your virtual environment
$ streamlit run webapp/app.py

If you saved the HierarchicalTagger instance in the previous step, you will find the amazon_products.json option from the Datasets drop-down in the left sidebar. Use the sidebar to make sure you are in the Tag Tree view, and you will see a sunburst visualization of the topics in the corpus, and how they have been grouped hierarchically. The chart is interactive, so you can click on a node to zoom into its descendants. Using the sliders in the sidebar, you can change the parameters for the fitting of the tree and see the effects in the visualization immediately. Increase the minimum document frequency to prune the smaller leaves in the chart; increase the minimum similarity threshold to push the tree to split branches more easily. 

Once you are satisfied with the structure of the tag tree, switch the view to Document Search to use the tags to search the corpus. After choosing a tag in the dropdown, the page will return the most relevant documents from the corpus and display their raw topic labels. Here too there are some parameters one can tweak to guide how documents are tagged. You can increase the minimum abstraction similarity if you notice that documents are being assigned too generously to tags, especially broad ones. Similarly, if you notice that documents are being assigned to tags that are peripheral to the document focus, try increasing the minimum tag score.

Let your creativity loose!

The initial iteration of this tool came out of our work with Vibrant Data Labs to create a searchable map of companies and organizations working on solutions to the challenges posed by climate change.

As we reach the end of this walk-through, we hope we’ve managed to trigger your curiosity to try the pipeline on some other data that is important to you. Indeed, we are releasing the pipeline and example code as we are confident it will be useful across a variety of domains. 

Of course, the pipeline can be improved in many ways. One improvement could be functionality for the user to edit the tag tree after fitting (like moving a branch onto another) or to impose constraints on how the tree can grow (for example, imposing that ‘board games’ should be a sub node of ‘toys’). 

That said, we’ve played around with book blurbs, news documents, academic papers on COVID-19, and documents made public by the CIA and found that, although far from perfect, it can deliver a lot of insight just out-of-the-box. Can you think of another dataset where the tool could help? Sign-up to an Engines free trial here and try it out!

We create the tools behind the decisions that change the world. ©2022 Primer