Distilling the contents of a large set of unknown free text documents and understanding how they relate to each other is relevant to many industry applications. This tutorial will take you through the HierarchicalTagger, a pipeline of NLP models created to address this broad use case.
Combining the power of Primer Engines with a custom prototype built on top of deep NLP models, the pipeline is designed to ingest an arbitrary set of documents, produce a hierarchical visualization of their contents, and finally make the corpus searchable by tagging each document with both specific and broad keywords. Check out this post if you would like to know more about how it works.
In this tutorial, we’re going to run the pipeline on a dataset from Amazon to create a product inventory from raw item descriptions. To run this tutorial, you’ll need to sign up for free for Primer Engines, our pre-trained models built for developers and data scientists.
At the end of this process, we’ll have a visualization like the one above, showing the hierarchy of topics covered by the product descriptions. We’ll also have a tagged and searchable document set containing both narrow and wide keywords. Finally, we will run a simple web app exposing both the visualization and search functionality to the user via a UI.
Here we go!
Set-up
We’ll start by getting access to the relevant code, data and additional packages. In a terminal, clone our GitHub repository and navigate to its root directory:
$ git clone https://github.com/PrimerAI/primer-hierarchical-tagger.git
$ cd primer-hierarchical-tagger
The full code for the workflow we’ll be going through here can be found in the examples
folder, and we’ll use the webapp
folder to spin up the app. If you want to have a look at the internals of the pipeline, checkout the code and comments in the HierarchicalTagger class
.
Next, let’s create a separate environment for our code to run in. We used virtualenvwrapper
, but you can use your favorite method as well. We ran the these commands to create the environment and install the required packages:
$ mkvirtualenv ht-repo
$ pip install --upgrade pip
$ pip install -r requirements.txt
The instructions in the requirements.txt will install all package dependencies, including the hierarchical_tagger module itself.
Finally, download the Amazon Product Dataset 2020 and save it in the examples/data/
folder. We renamed the file to amazon_products_2020.csv
.
We are now ready to launch a jupyter notebook. We do so from the root folder of the repository and add its path to the PYTHONPATH. This will allow us to call any python modules found in the root of the repository from inside the notebook.
$ PYTHONPATH=$(pwd) jupyter notebook
Open up the amazon-product-descriptions.ipynb
in the examples
folder. Run the cells under the Set-up
section to make sure all required packages are imported and paths are set-up correctly.
That’s it, we’re good to go!
Generate abstractive topics via Engines
We are ready to tackle the first substantive step in the pipeline: understanding what each document is about. Primer’s Abstractive Topic Generation Engine is very well suited for this step. Given a raw text document, the engine generates a handful of selected, highly-relevant and intelligible topic labels.
In practice, we would now hit the Primer APIs with batches of documents for processing and receive the desired results back. So that you can proceed directly to the next steps, we’ve done this for you, and included the processed results for a random sample of 3,000 products in this file. Feel free to save your Engines credits and proceed to the next section.
However, running the pipeline on your own data is easy and we’ve included everything you need to get going.
First, sign-up for an Engines free trial here. You get 2,000 free credits, which will cover processing for up to 2,000 short documents. If you would like to try the pipeline on a larger document set, please email community@primer.ai to request additional credits.
Once you obtain an API KEY, save it in credentials.py
as ENGINES_API_KEY="YOUR_ENGINES_API_KEY"
. This file is outside of version control so it won’t be revealed to others. This way you can also import the key into the notebook, instead of hard-coding it.
from credentials import ENGINES_API_KEY
Next, you’ll need to massage your documents into the standard format expected by the Abstractive Topic Generation Engine: a list of dictionaries with an id
and a text
key. This is how we did this for the Amazon Product Dataset. You would need to edit this line according to the format of your own data.
documents = [{"id": r["Uniq Id"], "text": r["About Product"]} for i, r in sampled_items.iterrows()]
We created the infer_model_on_docs
helper function that will take care of the communication with the Engines API for you. There is a LOT going on under the hood in this function. API calls are asynchronous, which means ‘waiting time’ while expecting a result from the API servers can be used productively by carrying out other operations in the program, for example to trigger other concurrent requests, or process the results from previous requests made. This increases the document processing throughput. Additionally, the functions below also allow batching of documents, so that a single request can return results on multiple documents. Finally, we use the tenacity
module to carry out automatic retries when facing transient errors.
You can find the code to send the documents to the Abstractive Topics Engine below. The cell processes a chunk of documents at a time and saves the results to a file at each iteration (just to be extra safe!). While the code above hides away all the internal complexity of the API calls for the users’ convenience, it’s always a good idea to test the API call on 1 or 2 documents to check everything is in shape before triggering a job on a large list of documents. For a test run, simply replace the documents
list with a small slice of the document set:
test_documents = documents[:2]
If all is in good shape, you can kick off the document processing:
ITEM_TOPICS = os.path.join(ROOT_DIR, "./examples/data/amazon_products.json")
topics = {}
# Infer topics from Engines
for doc_chunk in chunked(documents, 100):
topics_results = await infer_model_on_docs(doc_chunk,
model_name="abstractive_topics",
api_key=ENGINES_API_KEY,
batch_size=10,
**{"segmented": False})
topics.update(topics_results)
print(f"Collected topics for {len(topics)} documents")
# Save
with open(ITEM_TOPICS, "w") as f:
json.dump(topics, f)
It’s probably time to make a coffee. The Abstractive Topics Model would not get along well with the Macintosh you had as a kid: it’s one of those heavyweight NLP models with over 100 million parameters that need GPUs to run efficiently at scale. But after some minutes of waiting, you’ll be able to inspect the topic labels by product id
in the topics dictionary.
Ingest the processed docs into the HierarchicalTagger pipeline
Whether you ran Engines on your own data or used our precomputed dataset, you’ll be able to access the document topic representation like this:
topics["96d96237978ba26bbc6baa437372527a"]
OUT: {'topics': ['T6', 'Hover Board', 'Hover Scooter', 'Off Road'],
'segmented_topics': [['T6', 'Hover Scooter', 'Hover Board', 'Off Road']],
'segments': ['Make sure this fits by entering your model number. | FOR ALL RIDERS – The T6 can handle up to 420 lbs., making it the best choice for riders of all shapes and sizes! | ALL TERRAIN - Roll over bumps and inclines up to 30° as you travel through mud, grass, rain, and even gravel. | 12 MILE RANGE - The T6 off road hover board has a 12-mile range, and the capability to reach powered speeds of up to 12 MPH. | 10" RUGGED TIRES - Dual rugged, 10" tubeless tires designed for all terrain exploration. | ROCK WHILE YOU RIDE –The self-balancing hover scooter uses Bluetooth to play music directly from your phone.']}
It’s time to start-up the pipeline and load in our processed topics. The code below creates a HierarchicalTagger instance. It might take some moments the first time you run it, as it will download the SentenceBERT language model.
from hierarchical_tagger.hierarchical_tagger import HierarchicalTagger
hierarchical_tagger = HierarchicalTagger()
Next, we send the documents and their corresponding topic labels for ingest:
document_topics = {document_id: topics_entry['topics'] for document_id, topics_entry in topics.items()}
hierarchical_tagger.ingest(document_terms=document_topics)
This step is the most computationally demanding as it involves transforming all topic terms into a vector embedding space using the SentenceBERT language model. This is a fundamental step as it will allow the pipeline to measure semantic distance between terms. To avoid having to repeat this, we can save our HierarchicalTagger
instance to a json file, using the .to_json()
helper method. This file will also be the input data to our web app, so let’s save it in webapp/data/
:
SERIALIZED_INSTANCE_PATH = os.path.join(ROOT_DIR, "./webapp/data/amazon_products.json")
with open(SERIALIZED_INSTANCE_PATH, "w") as f:
f.write(hierarchical_tagger.to_json())
If we ever want to load up our instance again at a later date, we can simply run:
with open(SERIALIZED_INSTANCE_PATH, "r") as f:
reloaded_serialized = json.load(f)
hierarchical_tagger = HierarchicalTagger.from_dict(reloaded_serialized)
Build the topic tree and tag the documents
Next, we want to look across the corpus to get a big-picture view of the topics it spans and how these relate to each other. In particular, we want to learn the hierarchical relationships between the topics. With that goal, we use agglomerative clustering – a bottom-up hierarchical clustering method – to extract a tree data structure connecting related terms into ever broader groups sharing semantic proximity.
The simplest way to try this is by calling the .fit_tag_tree()
method. This populates the .tree
attribute with a treelib
object representing the extracted term tree. This can be manipulated and explored with all the treelib
methods, for example .show()
to print out a text representation of the tree.
ht.fit_tag_tree()
ht.tree.show()
toy
├── crafts
│ └── vehicle
│ ├── cars
│ │ ├── 4wd monster truck
│ │ │ └── monster truck
│ │ ├── automotive industry
│ │ │ ├── automotive design
│ │ │ └── ford mustang
│ │ ├── car racing
│ │ ├── cars cars 3
│ │ │ └── toy cars
│ │ ├── hover board
│ │ │ ├── skateboarding
│ │ │ │ ├── chalkboard
│ │ │ │ └── skates
The final step is tagging the original documents based on the hierarchy we found in the tree, and exposing them for search. Once again, a default call to .tag_documents()
will do the trick. The results will be in the .document_tags
attribute: a dictionary mapping document id
to a list of tuples of the form (term, score, node_id)
sorted by descending score. score
measures how close in meaning the term is to the document. We would expect higher level abstractions to have lower scores. node_id
loosely indicates how high the node is in the tree: it’s not a perfect measure, but more abstract terms will generally have higher node ids.
ht.tag_documents()
ht.document_tags # {doc_id : [(tag, score, approximate hierarchy level), ...]}
Here’s how the pipeline performed for the ‘hover board’ item we saw above.
hierarchical_tagger.document_tags["96d96237978ba26bbc6baa437372527a"]
OUT: [('hover scooter', 0.5336780615328567, 1221),
('skateboarding', 0.5263492531558106, 1654),
('electric scooter', 0.4128446044700507, 1766),
('skates', 0.3843635235742917, 1824),
('sports', 0.2278761348235242, 1933),
('car', 0.17362573115847066, 1981)]
As you can see, the top-scoring tag is spot-on to the specific item description. Beyond that, although not perfect, the pipeline partially succeeds in mapping the item to higher level concepts that, importantly, were not present in the original document representation. It assigns a medium-strength (0.41) tag of ‘electric scooter’, linking the document with other items related to the broader concept of electric mobility. Similarly, the ‘skates’ tag establishes a link with other highly-related sporting equipment. The pipeline also makes a very accurate link to the much broader domain of ‘sports’, with a low score correctly measuring the notable degree of abstraction between the specific item and this high-level concept. Having this sort of tag immediately translates into an improved search experience.
Tuning and human-in-the-loop
Of course, some things are off, with the low-scoring tag of ‘car’ not really fitting in with the hoverboard example. It would be great if the AI ‘just got it’ out-of-the-box, but that’s generally not how it happens. Instead, our aim is for the tool to kick-start, and then significantly enhance, the investigation efforts of the human-in-the-loop.
To this end, we expose several tuning parameters that the investigator can tweak to guide the extraction of the term tree and the logic applied when tagging the documents. The analyst can also feed their own domain knowledge by suggesting additional terms to be included in the tree. This input could even be estimated from a different corpus; what we might think of as a loose form of transfer learning.
For example, one could run the pipeline on descriptions of toys from Amazon to extract an initial term tree, then feed that set of terms as suggestions to the pipeline when analyzing a corpus of letters to Santa.
The point here is that the exploration workflow is likely to be iterative: starting from a set of unknown documents, the investigator can repeatedly run the pipeline, using the pipeline options as levers to guide the results toward better results each time.
Additionally, the tagging from this tool can be used to generate pseudo-labels to train custom models. For example, say you need a classifier to identify sports items in a larger set of product descriptions. Just using a few sports-related tags from the pipeline will immediately give a pseudo-labelled training dataset. Uploading that data into Primer Automate, you could have a trained model in just a few clicks.
Exploring the corpus with the web app
We have created a simple web app to facilitate this iterative exploration. Running the following command will launch the app at the http://localhost:8501/
address.
$ workon ht-repo # Or alternative command to activate your virtual environment
$ streamlit run webapp/app.py
If you saved the HierarchicalTagger instance in the previous step, you will find the amazon_products.json
option from the Datasets drop-down in the left sidebar. Use the sidebar to make sure you are in the Tag Tree
view, and you will see a sunburst visualization of the topics in the corpus, and how they have been grouped hierarchically. The chart is interactive, so you can click on a node to zoom into its descendants. Using the sliders in the sidebar, you can change the parameters for the fitting of the tree and see the effects in the visualization immediately. Increase the minimum document frequency
to prune the smaller leaves in the chart; increase the minimum similarity threshold
to push the tree to split branches more easily.
Once you are satisfied with the structure of the tag tree, switch the view to Document Search
to use the tags to search the corpus. After choosing a tag in the dropdown, the page will return the most relevant documents from the corpus and display their raw topic labels. Here too there are some parameters one can tweak to guide how documents are tagged. You can increase the minimum abstraction similarity
if you notice that documents are being assigned too generously to tags, especially broad ones. Similarly, if you notice that documents are being assigned to tags that are peripheral to the document focus, try increasing the minimum tag score
.
Let your creativity loose!
The initial iteration of this tool came out of our work with Vibrant Data Labs to create a searchable map of companies and organizations working on solutions to the challenges posed by climate change.
As we reach the end of this walk-through, we hope we’ve managed to trigger your curiosity to try the pipeline on some other data that is important to you. Indeed, we are releasing the pipeline and example code as we are confident it will be useful across a variety of domains.
Of course, the pipeline can be improved in many ways. One improvement could be functionality for the user to edit the tag tree after fitting (like moving a branch onto another) or to impose constraints on how the tree can grow (for example, imposing that ‘board games’ should be a sub node of ‘toys’).
That said, we’ve played around with book blurbs, news documents, academic papers on COVID-19, and documents made public by the CIA and found that, although far from perfect, it can deliver a lot of insight just out-of-the-box. Can you think of another dataset where the tool could help?
We create the tools behind the decisions that change the world. ©2022 Primer