Primer’s R&D and Engineering teams are working to solve some of the most complex and challenging problems in machine learning. Here’s where Primer stands apart—and what’s coming next in a rapidly advancing field.

For machine learning engineering teams trying to implement NLP, it’s challenging to select the right tools to enable this capability in your organization. What does it look like on the ground to solve problems in a deep learning NLP company, one where sometimes you’re building your own tools? Here’s how we solve things at Primer, as well as where we are focusing our research & development next.

Read More: The 5 Predictions for NLP

1. Zero shot entity recognition 

Natural language processing is based on models that recognize entities – for example, peoples’ names, or specific classifiers like ‘tanks,’ ‘trees,’ or ‘fruits.” The classic ML process is to get a dataset with those entities and then train the model how to recognize them. But zero shot entity recognition is a model that is able to recognize arbitrary entities without further training. In other words, with a zero shot model, you essentially skip training, give users the model, and it works – in “zero shots.” The closest thing to zero shot models in machine learning currently are “few shot models,” which do the same thing with limited training.    

Still in R&D at Primer, zero shot entity recognition is something that has not been pulled off publicly by any company to date. Primer’s R&D engineers recently created a model that can theoretically perform zero shot, essentially a “Version 0” zero shot model. In this model, the parameters are there, but performance isn’t high enough yet to deploy, prompting additional R&D. The current zero shot model is trained on thousands of entity classes and documents. 

Much like a human might be able to read new words through context and have a decent understanding of the meaning, a zero shot model should, in theory, be able to generally recognize a word if it exists in the English language, even if it wasn’t trained on that word. In short, the zero shot model is trained to recognize so many entities that it’s able to generalize to entities it hasn’t seen yet. As for the impact? Months of work building a model could be reduced down to just a few hours. 

2. Inference triage

GPU usage is a major part of a company’s spend when investing in ML and NLP. But what if that cost could be reduced by up to 85% simply by changing the way a model is deployed? 

To solve this problem of expensive GPU run time, Primer created inference triage – reducing computational load by training a cheaper and faster ML model to outsource tasks only when necessary to a larger model which require more compute power to perform. But that’s the key – tasks are only outsourced when necessary. Let’s break down how this works.

At Primer, we like to have fun along the way. We gave the name of this project “BabyBear,”– that’s the name of the cheaper and faster model, or the “younger” model. In this project, “Baby Bear” will look to another model, named “MamaBear,” to answer questions that the BabyBear model cannot answer on their own. Our current version of inference triage algorithm works on classification and entity recognition tasks. In this algorithm, predictions from the MamaBear model are considered the gold-labeled training dataset for the BabyBear model. For every dataset, if the BabyBear model is confident in the predictions, it will be considered as the final prediction—otherwise MamaBear is called.

Depending on the task, GPU run time was reduced by up to 85% in Primer’s model testing. 

Reducing GPU needs has profound effects on cost in ML, and it also increases the hardware flexibility for on-prem deployments.  Having models that are more efficient means AI can be run on a wider range of hardware with a smaller footprint.

Primer has already started to deploy these models on Named Entity Recognition (NER), but it could be deployed on a wider variety of text data tasks. Even better? Reducing GPU not only affects run time and financial output, but significantly reduces the carbon footprint of running the models—a significant need in NLP.

3. Custom summarization  

The ability of ML models to not just read and write, but truly understand and synthesize information, is one of the most desired aspects of NLP. 

In partnership with Summari, Primer’s forward-deploy engineering team used Primer’s Platform to train and deploy a customized text-to-text summarization model. Coupled with Primer’s data ingestion pipeline, the custom model delivers human-quality summaries of any long-form article on the internet instantly. Faster delivery times allow Summari to expand their offerings from a few dozen publications to the entire internet. The market’s reception has been strong, earning Summari the top spot on Product Hunt. It’s a process that could be repeated with other models. 

It’s not just a breakthrough for Summari, but for NLP technology itself. An instant summary, in human quality, of a massive amount of information displays the power of NLP.

4. Topic modeling 

Topic modeling is a feature that was developed by Primer for a leading business publication. The publication has a large volume of content that lacks search and findability for its customers. The publication wanted to explore how ML pipelines could help. 

Primer’s team built a custom text-to-text model to help tag their enterprise taxonomy of 700 different terms. The result was a model that helped automate the process of tagging articles and cases, resulting in a better and cleaner taxonomy with clear hierarchy and categorization. This solution could also apply to any larger business with a corpus of data to both save labor time and costs for manual curation, and also provide better metadata to create a better search and user experience for customers.

5. Synthetic text detection

Detecting the spread of disinformation is a top priority for many NLP providers. How can analysts make sense of their world at scale, without accurate detections of networks and bots that seek to manipulate data? The implications are immense: Bots are often detectable because they are posting the same message over and over. When you look at a bots profile they often have ‘tells’ such as imperfect use of the language, appear to have a singular theme to their posts, and have numerous bot followers. But these ‘tells’ are getting increasingly difficult to identify with recent advancements in synthetic text generation. Primer Command can already detect synthetic text and disinformation within specific data sources. Even more valuable, Primer’s R&D engineers are developing an advanced version of this technology, which you could deploy on any available data. 

6. An end-to-end full stack platform for natural language processing 

Doing all of this R&D is a lot of work. The last thing data scientists want to do is set up the infrastructure around it. Primer is building infrastructure for their data scientists and forward deployed team to build tools, models, and systems faster. That means instead of using one vendor to ingest data, another to label data, and yet another to build and deploy models, each of these steps lives together on a unified platform. 

We also built Data Map to empower data scientists to find mislabeled examples in an automated way. Spending less time labeling and wrangling data means focusing on creating ML models. 

It also means ML teams don’t have to build and maintain this complex toolchain—one which is so complex, most projects often fail. Now engineering and data science teams have the tools that give them the full capabilities to do this for any use case or data source.

For more on Primer’s products and infrastructure, visit the resource library.

BabyBear cuts GPU time and money
on large transformer models

When you’re processing millions of documents with dozens of deep learning models, things add up fast. There’s the environmental cost of electricity to run those hungry models. There’s the latency cost as your customers wait for results. And of course there’s the bottom line: the immense computational cost of the GPU machines on premises or rented in the cloud. 

We figured out a trick here at Primer that cuts those costs way down. We’re sharing it now (paper, code) for others to use. It is an algorithmic framework for natural language processing (NLP) that we call BabyBear. For most deep learning NLP tasks, it reduces GPU costs by a third to a half. And for some tasks, the savings are over 90%. 

The basic idea is simple: The best way to make a deep learning model cheaper is to not use it at all, when you don’t need it. The trick is figuring out when you don’t need it. And that’s what BabyBear does. 

In our algorithm, the expensive deep learning model is called mamabear. As the documents stream in to be processed by mamabear, we place another model upstream: babybear. This model is smaller, faster, and cheaper than mamabear, but also less accurate. For example, babybear might be a classical machine learning model like XGBoost or Random Forest. 

The babybear model can be whatever you want—as long as it produces confidence scores along with its predictions. We use the confidence of babybear to determine whether an incoming document requires mamabear. If it’s an easy one, babybear handles it and let’s mamabear sleep. If it’s a difficult one, babybear passes it to mamabear.

How does babybear learn this task? During a “warm up” period, we send the stream of documents to mamabear, just as we would in production without the BabyBear framework. The babybear model directly observes mamabear as an oracle, making its own predictions to learn the skill. Once it has sufficient skill, it gets to work. 

For example, we took this open source sentiment analysis model as a mamabear. For the babybear warm up we trained an XGBoost model using 10,000 inferences from mamabear as gold data. With hyperparameter optimization it took 113 minutes. But the babybear had already learned the task sufficiently to get to work after about 2000 examples—less than half an hour of training.

Every NLP practitioner has created keyword filters upstream of expensive models to save unnecessary processing. BabyBear just replaces that human data science work with machine learning.

All that you need to do is tweak one parameter: performance threshold. That determines how much loss in overall f1 score you’re willing to pay in return for computational savings. Whatever you set it to—10%, 5%, or 0%—BabyBear will save as much compute as possible by adjusting its own confidence threshold to meet that target.

What I’ve described so far is the simplest version of BabyBear. It applies to document classification, for example, where babybear is a single model upstream of mamabear and is learning to perform the same task. In our paper, we describe more complicated versions of BabyBear. For NLP tasks such as entity recognition, we achieve greater savings using a circuit of multiple babybear models. And some of those babybear models can be distilled (cheaper, faster) versions of mamabear. But the same algorithm applies. 

BabyBear is the framework that we use here at Primer to automate some of the data science we once laboriously did ourselves. For the vast majority of NLP tasks—basically everything not requiring text generation—BabyBear can help prevent needless draining of your wallet and the electrical grid.

Try it out on your data.

Since the development of the Transformer model, finding and re-labeling mislabeled examples is becoming a much more important factor in model performance. Because Transformer models mean ML engineers are finding that they need less and less data to create a model that performs at production quality for their task, the impact of mislabeled data has a greater and greater impact on model quality.

As an example of how this plays out: for a simple classification task, a Transformer-based model like XLNet may only require a couple hundred examples in order to perform well. When your dataset only has 200 examples, and 10% of them are mislabeled, those mislabeled examples can have a big impact, dragging performance down and making it harder to improve the model even when more examples are added. As a result, nowadays it’s more important for model creators to focus on data quality than data quantity

This is because Transformers can be “pre-trained” on natural language (such as random internet text) learning the rules of language simply by reading lots of documents. They can then be “fine-tuned” to perform a specific task, such as classification or named-entity recognition. Since we can start with a pre-trained model, which has already learned from millions of documents, fine-tuning for a more specific use case requires orders of magnitude less data. As an aside, speeding up model development is the rationale behind Primer’s pre-trained, domain specific models

Let’s say you’re doing a classification between two classes, Class A and B. When your model sees an example that clearly looks like Class B, and instead it’s labeled Class A, it now has to adjust its idea of what Class A looks like to accommodate this weird Class-B-like example, which is likely an outlier relative to the other Class A data points. If this happens too many times, it can distort the model’s overall understanding of the task, causing it to underperform on both Class A and Class B.

Data Map simplifies identifying mislabeled data

Data Map is a Primer feature that makes it easy to visualize the outliers so that data scientists and labelers can quickly and easily spot and correct mislabeled examples.

Most data science teams are leveraging some form of external help with annotating examples, as businesses generally want to minimize the time data scientists or business subject matter experts spend time on data labeling. As a result, the annotators’ understanding of the task often differs from that of the SMEs, and examples can get labeled incorrectly as a result. 

We built Data Map to empower data scientists to find these mislabeled examples in an automated way. On top of that, Data Map has the added benefit of giving us additional information about how the model perceived the examples in the dataset during training. (We’ll get into that during the technical explanation of Data Map.)

Our internal experiments using Data Map to find errors showed that, for small datasets, Data Map can identify up to 80% of the errors in the data by examining only 5% of the dataset. 

More directly, instead of re-reviewing the entire labeled dataset, data scientists can simply look at the 5% “most likely mislabeled” examples and re-label those, and rest assured that the remainder of the dataset is likely correct.

How does Data Map work?

Data Map is based on the paper Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics. The idea is to track three relevant metrics for each example as the model is being trained on those examples. 

The metrics are:

  • Correctness: How often the model got this example correct during training; or, more technically, the proportion of training epochs during which the model predicted the example correctly.
  • Confidence: The model’s average level of confidence when making predictions for this example during training.
  • Variability: How often the model changed its mind about the example’s predicted label during training. An example that oscillates between being classified as Class A and Class B repeatedly will have high variability; an example consistently classified as Class A will have low variability.

Each metric falls in the range [0, 1]. Usefully, these metrics also allow us to group data points into descriptive categories: easy-to-learn, ambiguous, and hard-to-learn. 

An easy-to-learn example will typically have high correctness, high confidence, and low variability.
An ambiguous example will typically have high variability, because the model was indecisive about where to put it. There won’t necessarily be a trend with respect to correctness or confidence.

A hard-to-learn example will most importantly have low correctness. There won’t necessarily be a trend with respect to variability or confidence.

These three metrics, taken together, give us a better picture of how challenging the model found those examples over the course of the training. 

The authors of the original paper found many interesting correlations with respect to the role that each group of data points plays in helping a model learn a task, which we would encourage you to check out if you’re interested. Here, we will focus on the hard-to-learn group.

Hard-to-learn examples are the examples we single out as being “possibly mislabeled.” The authors of the original paper found that hard-to-learn examples were frequently mislabeled, and our internal research has replicated this finding. Hard-to-learn examples which also have high confidence and low variability are particularly likely to be mislabeled (rather than just challenging), because it means the model was very confident about its answer, but still got the example consistently wrong.

Technical implementation

First, a research engineer on our Applied Research team had to implement the tracking of Data Map metrics into our internal deep learning library’s classifier training module. This required digging into the training loop for the model and inserting additional logic to store data about the relevant concepts after each epoch, then calculate them at the end of the training session.

Then, we had to spec out an ideal way to track and store the Data Map information our Postgres database. This required the creation of a dedicated “cartography” table, and updates to business logic throughout the application to introduce the concept of “possibly mislabeled” documents, as well as to keep track of whether or not users have “reviewed” their mislabeled documents. This was important because we didn’t want to constantly resurface the same documents to users if they had already addressed them. 

Part of the data modeling work was to implement additional functionality into the deep learning module to track training examples by ID from beginning to end, so we could link them back to their original document IDs – something we hadn’t needed to worry about up to this point.

Next, our ML platform’s training service needed to make the requisite updates to pass the Data Map information through and save it to S3 whenever Automate sends it a new training job. We also needed to update the number of epochs we were using to give us more granular values from the data map calculations (Particularly for correctness – it wouldn’t be ideal if an example could only ever have .33, .66, or 1.0 correctness.).

Finally, our design and frontend team created an informative, user-friendly interface to communicate this information, and provide multiple entry points from Data Map to our document labeling experience. Customers can see Data Map results for individual documents while labeling, filter their list of labeled documents by whether they’re mislabeled, and can preview and jump directly to a mislabeled document by clicking on its data point in the graph. 


Data Map is a key utility for quality assurance of datasets in Primer, saving customers time by allowing them to review a subset of examples for errors rather than the entire dataset. Techniques like Data Map are particularly powerful because they help customers to get more performance out of existing data, rather than label new data and are an obvious example of Primer’s focus on data efficiency. We’re determined to make time-to-value short so customers can hit the ground running on their projects with a performant model.

For more information about Primer and to access product demos, contact Primer here.

Analysts in financial organizations are faced with the task of analyzing massive amounts of data in order to make critical decisions involving risk, compliance, customer experience, and investments. The problem is that the data is unstructured – it doesn’t fit into a tabular format – which means (until recently), humans needed to process it manually.

Today, analysts spend roughly two-thirds of their time, on average, collecting and understanding this data before knowing whether the information is material (Deloitte Insights). While regulatory documentation is ballooning in volume, making it even harder for humans to keep up, spurring the need to automate this analysis. 

Primer has created two ready-to-use NLP Models that structure information from financial documents, making it simple to extract insights: Finance Entity Recognition and Finance Relation Extraction (FER Relex)

One of the core tasks of NLP is Named Entity Recognition (NER), which extracts entities in text and classifies them into predefined categories (companies, products/services, people, etc). When tasked with extracting insights from huge amounts of financial data, automatic entity recognition is an important first step, and then adding relationship extraction on top takes your analysis to the next level.  

FER Relex, for example, can run through a document and 

  1. identify all products that a company has, and then
  2. identify all revenue / financial metrics for a organization that a company might be looking for

FER Relex is great for investor portfolio analysis, and it also supports use cases outside of finance, like market research, brand awareness, competitive analysis, and risk and compliance work. 

Ok, let’s put these two models together in a finance use case.

Building a financial entity extraction NLP model

The most commonly used entities in NER are people, organizations, and locations, but these categories are often insufficient for industries such as finance, medicine, or national security. 

Financial analysts can’t identify entities of interest using off-the-shelf NER without retraining a specialized model from scratch to recognize finance-specific entities, which can take months if the effort succeeds.

To create our Finance Entity Recognition (“FER”) model, we arrived at six relevant entity types: Product/Service, Industry/Sector, Regulation, Financial Metric, Ticker, and Currency Amount and then trained the model accordingly.

Training an FER model: Labeling data

The most important step in building most machine learning models is acquiring a high-quality labeled dataset, and that was certainly the case here. One hundred great labels can mean the difference between a model changing the business outcome, or it just being another underperforming prototype, so we invest a lot here. 

We acquired a diverse set of over 500 documents consisting of SEC filings, earnings call transcripts, and various financial news publications. The next step was to have our labeling team at label occurrences of our financial entities in the text. As an aside, If you want help labeling a model, keep in mind that the same team we use to label these models is available to Primer customers!

Here are are a few labeling examples:

Labeling challenges

After labeling some documents, we realized that this task, like many labeling tasks, contains inherent ambiguities. For example, we knew that “financial metrics” would cover common accounting measures like EBITDA, COGS, and EPS. However, we hadn’t considered how to parse something specific like “Joint Venture amortization of acquired above and below-market leases, net”. What should the metric be here? Just “amortization”, the whole phrase, or some other part of the phrase? 

Additionally, we had to decide whether to consider generic products like “medicine” and “toothpaste”, or to include only named products. We encountered many such edge cases that experts might not agree on. We iterated on the definitions with our labeling team, and relied on inter-annotator agreement to help us identify which edge cases needed tightening.

As organizations apply NLP to their particular use cases, SMEs and data scientists need to make similar decisions in their labeling processes to ensure that models capture the most relevant information. 

Training a token classification model

Entity extraction tasks are typically framed as token classification tasks. Given a text document, we first split it into tokens, which are typically words or subword strings. We then assign a label to each token. The most common token labeling scheme is called BIO (beginning, inside, outside) tagging. A token gets a “B-” prefix if it’s the first token of an entity, an “I-” prefix if it’s inside an entity, or an “O” label if it’s not part of any entity. The suffix for B and I tokens is the entity type. For example, the entity-tagged sentence

might get transformed into the following tagged token sequence

We trained our token classification model using a pre-trained XLNet transformer. XLNet was pre-trained using an improved methodology compared to BERT and performs better on a variety of tasks. It comes with an accompanying tokenizer which tokenizes text according to a fixed set of rules. The English words in our above example got a single token each, as did punctuation marks, but the ticker symbol “CMCSA” got split up into two tokens. The ▁ characters indicate whitespace before the token. 

Dataset cartography

Given how crucial high quality labels are for creating a good machine learning model, we decided to see if we could improve label quality even more before training our final model. To do this, we used a technique inspired by the paper Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

In the paper, Swayamdipta et al. calculated training dynamics during model training, named confidence, variability, and correctness. For a model trained over multiple epochs, confidence for a single example is the mean probability of true prediction across the epochs, variability is the standard deviation of the probability of a true prediction across the epochs, and correctness is the proportion of time the example was predicted correctly across the epochs. 

The intuition is that low-confidence examples are likely to be mislabeled, and high-variability examples are often ambiguous (see Figs. 1 and 2 in the paper). If you wanted to correct mislabeled examples in your dataset, you would be best served to check the lowest-confidence examples first. 

We adapted the dataset cartography technique to token classification by focusing on confidence only, and imposing a shading scheme to show which tokens the model is less confident about. In the following example, the text color indicates the label given by the annotator, and low-confidence tokens are indicated by a darker blue highlight.

Here we see that net expense ratio and distribution or service fees are correctly labeled as metrics, and the model is very unconfident that management fees and sales charge are not entities, i.e. they should probably be labeled as metrics. This shading scheme allowed our annotators to quickly scan the documents for false negative labels, false positive labels, or labels with the wrong entity type.

Relation extraction

Extracting entities is useful, but that is just where the fun begins. 

The next Engine we built using our financial entities is called Financial Relation Extraction. Relation extraction is the task of deciding what relationship, if any, exists between two given entities in a text. We incorporated the traditional Organization and Location entities of NER in order to train a relation extraction model that can answer the following questions: 

  • Product/Service to organization: Which company owns the product or service?
  • Industry to organization: What industry does this company or organization belong to?
  • Organization tolocation: Which location is this company targeting, breaking into or located?
  • Currency amount to financial metric: Connect a dollar amount to the financial metric it refers to.
  • Ticker to organization: Connect a stock ticker to a company name.

We labeled pairs of entities as either having one of these relations, or having no relation. The approach we used to train the relation extraction model is called Typed Entity Markers. Conceptually, the goal is to activate associations learned by a language model during pre-training by highlighting entities of interest and the concepts they represent. We focused the model’s attention on the entities of interest by wrapping each entity in the text with the “@” symbol (the “marker”) and we prepended the assigned label from the FER model (the “type”) to provide additional contextual clues. For example, consider this sentence with tags output from the FER model:

For each pair of entities in the sentence, we apply typed entity markers to obtain inputs

  1. @ *organization* Exxon Mobil Corporation@ is an American multinational @ *industry* oil and gas@ corporation headquartered in Irving, Texas.
  2. @ *organization* Exxon Mobil Corporation@ is an American multinational oil and gas corporation headquartered in @ *location* Irving, Texas@.
  3. Exxon Mobil Corporation is an American multinational @ *industry* oil and gas@ corporation headquartered in @ *location* Irving, Texas@.

While texts that contained many entities resulted in a large number of entity pairs to consider, we reduced the number of samples run through the model by recognizing that many entity pairs have types that are incompatible with any relation of interest. For example, sample (3) above has entity types “industry” and “location,” a pair we can label as “no relation” without any additional computation.  

Once the input data was constructed for each entity pair of interest, we tokenized the text for use with a RoBERTa pre-trained transformer encoder. RoBERTa outputs vector representations for each input token, but we selected just the representations for the first “@” marker as an embedding for each entity. Those embedding vectors were passed through a classification neural network, which either assigned one of the defined relations above or the label “no relation.” Since we considered a custom list of relationships, the whole model was fine-tuned on a domain-relevant dataset.

Putting it all together, here are some examples of extracting financial entities and relationships from text. Given an input like

“As we continue to navigate the ongoing global pandemic, we continue to see a slower rate of commercialization of ANJESO than we would have expected without COVID-19, but feedback from users and our third-party market research is encouraging,” said Gerri Henwood, President and CEO of Baudax Bio.

Our FER model extracts ANJESO as a product/service, the NER model extracts Baudax Bio as an organization.Then, the relation extractor determines that the relation between these two entities is Product/Service to Organization.

Here’s an example with multiple relationships:

First, in October, we amended and restated of our bank term loan, increasing the outstanding balance to $350 million from $325 million, extending the maturity to April 2024 from April 2021, and reducing the interest rates spread to 90 basis points, down from to 90 basis point, down from 95 basis point over LIBOR.

The model extracts three financial metrics: outstanding balance, interest rates, and LIBOR. It also extracts $350 million and $325 million as currency amounts, and identifies that these both refer to the same metric (outstanding balance). 

These models show just a few ways that structure can be extracted from unstructured text. Many more entity types and relationships are possible. What relationships are hiding in your data?

Applied natural language processing (NLP) is all about putting your domain expertise into an infinitely scalable box. Expertise is expensive, and engaging a fleet of experts to trawl through the deluge of text that comes into your systems every day is cost prohibitive. Even if money were no object, it would be challenging to find, recruit, and manage the sheer number of experts needed to effectively deal with an ongoing torrent of new information. 

What if you could bring your expertise to bear on only a few examples and then hand them off to a machine to analyze? That’s the promise of NLP. You put your domain expertise in a machine readable format by annotating data, and the NLP algorithms will turn that data into a box that can understand not only what you’ve shown it, but also generalize to understand previously unseen words, phrases, and concepts. Because it’s a machine, not a human, you can scale your new box infinitely. 

Labeling is laborious

The one snag is that you’ll need to label data to get your expertise into a machine readable format. No one likes to label data. It costs money and takes time. It’s hard to do. It’s tedious and monotonous. Unfortunately, when most people realize how laborious annotation is, they try to minimize the amount of annotation they need to do instead of minimizing the time they spend labeling. Not only do most “cost saving” workarounds end up costing more, they also reduce the quality of the data that is produced.  

Multitasking adds exponential risk

The number one failure point in annotation projects is asking annotators to do multiple things at once. The logic behind this mistake seems sound at first glance. “I’m paying these folks by the hour, so why would I have them look at the same document three times, once for each task I need them to do? I can have them look at the document once and do three things instead.”

But, this unfortunate line of thinking overlooks a whole host of subtleties. Annotation throughput is only important when annotation quality is satisfactorily high. Remember the adage: “Garbage in, garbage out.” You want your annotated examples to reflect the full extent of your expertise—gold, not garbage, in that scalable box.

The x-factor of cognitive load

Annotation quality is heavily influenced by the cognitive load placed on the annotator. If you ask someone to make three decisions instead of one, that’s 3x the cognitive load on that person. Not only does this make every decision slower, but it also increases the probability of error with every decision, thus lowering the throughput and quality of the entire annotation endeavor. 

Touch time vs. lead time

Another source of this misguided tendency towards multitasking stems from a project manager’s inability to distinguish between lead time and touch time. Touch time measures the time it takes to make a single annotation. For example: how long does it take to find a word and double click on it? Maybe 500 milliseconds. 

Lead time, on the other hand, measures the time it takes for the annotator to be ready to begin  annotation work. For example, how long does it take for the source document to load? It could take several more minutes if the annotator needs to first call their manager and ask what to work on. Stepping back even further, lead time also includes the time it takes for a project manager to plan the work for a team of annotators and build a work schedule. This could be days or weeks if the manager is busy with other higher priority tasks, out sick, or on vacation. 

Goodbye lead times

Long lead times might be something that your organization is used to (and thus expect), but unlike death and taxes, you can actually avoid them. In fact, LightTag makes lead times go away. In annotation workflows, the primary source of lead time is coordinating work — i.e., the time it takes for an annotator to discover what is the next unit of work they need to complete, fetching that unit of work, and reading any necessary instructions. LightTag brings lead time down to milliseconds by fully automating the allocation of work amongst a pool of annotators and embedding instructions within the labeling interface. In that way, the average lead time in LightTag is measured in milliseconds, not minutes. 

Case study: Drug-drug interaction annotation 

Let’s look at a case study. A major pharmaceutical company wanted to build a corpus of drug–drug interactions (DDIs). Given a paragraph mentioning two or more drugs, the task was to extract the pairs of drugs that were said to be interacting (e.g., don’t mix Xanax and Alcohol). 

Annotating DDI references would seem to require a team of highly paid pharmacologists to pore through thousands of medical articles, highlight the all mentions of drugs, and then somehow connect each pair of interacting drugs — a slow and expensive process indeed. 

LightTag’s UX and project management capabilities reduced both lead time and touch time to the point where our customer was able to outsource the annotation work and use their internal domain experts only for spot checking the final result. 

How did it work? LightTag minimized the project’s lead time by pre-annotating individual drugs and displaying pairs of highlighted drugs at a time. This lowered the cognitive load on the annotator, allowing them to work faster and with less risk of error. 

Traditionally, relationship annotation is done via a drag-and-drop interface where annotators manually form connections between entities. LightTag further reduced touch time by changing that paradigm. Since pairs of entities were pre-annotated, the annotator needed to only make a single click to classify if and how a pair was related, instead of the multiple point-and-click steps they’d need to take in a traditional setup. 

LightTag also minimized lead time by allowing our customer to outsource the annotation work. They could define the work to be done in bulk and automatically distribute the work to annotators in small batches. This meant that the project management team did not need to spend time during the course of the project on work allocation, nor did annotators have to wait for work to be assigned to them. The end result: what used to take months, took only a few days. 

Annotation is a core part of delivering customized and targeted NLP solutions. Primer had been a LightTag customer for years before acquiring the company and integrating those capabilities into Primer’s own NLP platform. 

Today, LightTag is available both as a standalone SaaS or on-premises offering for those seeking an annotation solution only. LightTag capabilities are also available as part of the Primer platform where it integrates with model training, data ingestion, and production QA for NLP workflows. 

To get hands on with LightTag, try it for free today.  

Copy a simple Apps Script into any Google spreadsheet to quickly run multiple Primer Models on text in any data cell

Understanding a large corpus of text-based documents is getting a lot easier with NLP. Where previously we might need a human to perform a manual review of social media posts, news articles, or financial and legal documents, now we can process them using NLP models to extract entities and key phrases, understand sentiment, and classify documents.

This tutorial shows you how to run a sentiment model and a key phrase extractor with Primer Engines to analyze customer reviews. We’ll get our reviews from Amazon and then use Primer’s pre-trained models (we call them Engines) to analyze sentiment and extract phrases from each review to see what patterns we can detect. To make things even simpler, this tutorial lets you play around with Engines inside of Google Sheets.

To get your Engines API key, Sign up here:

Once you have an account you’ll have the ability to create your api key within the app:

All you’ll need to do to run this tutorial is have a Google spreadsheet and to sign up for free for Primer Engines. Let’s get started. 


To start, let’s collect the top 10 customer Amazon reviews from this Bubble machine, as well as a few one star reviews. You can do this yourself, but we’ve gone ahead and collected them in this spreadsheet that you can clone

Once we have built our dataset, we want to empower the spreadsheet with NLP. To do this I’m just going to copy this snippet of code into the Apps Script module of the Extensions tab.

var api_key_row = 2;
var api_key_col = 2;
var API_KEY = SpreadsheetApp.getActiveSheet().getRange(api_key_row, api_key_col).getValue();
// flag for editing without triggering requests.
let calls_disabled = false
// default Engines request data
class EnginesRequest {
 constructor(endpoint) {
   this.base_url = “”;
   this.endpoint = endpoint
   this.headers = {“Authorization”: “Bearer ” + API_KEY,”Content-Type”: “application/json”};
 post(json_body_as_string) {
   // Fastest option but subject to 30/minute Engines rate limiting
   if (calls_disabled) {
   return “Engine Offline”
   let request_url = this.base_url + this.endpoint;
   let options = {“method”:’post’,’payload’:json_body_as_string,’headers’:this.headers};
   let response = UrlFetchApp.fetch(request_url,options) ;
   return [[this.endpoint,response.getContentText()]]
 post_async(json_body_as_string) {
   if (calls_disabled) {
   return “Engine Offline”
   let request_url = this.base_url + this.endpoint;
   let options = {“method”:”post”,’payload’:json_body_as_string,’headers’:this.headers};
   let response = UrlFetchApp.fetch(request_url,options) ;
   if (response.getResponseCode() !== 200) {
     return [[this.endpoint,response.getContentText()]]
   let result_id = response.getContentText();
   options = {“method”:”get”, ‘headers’:this.headers};
   let get_url = this.base_url + “v1/result/” + result_id.replace(/[‘”]+/g, ”);
   let get_response = UrlFetchApp.fetch(get_url,options);
   let count = 0;
   while (get_response.getResponseCode() == 202 && count < 1000) {
     count += 1;
     get_response = UrlFetchApp.fetch(get_url,options);
   return [[this.endpoint,get_response.getContentText()]]
function genericEngineSync(endpoint, text) {
 if (text === “” || text === null || text === undefined) {
   return “Engine Disabled – Text Field Required”
 let req = new EnginesRequest(endpoint);
 let body = JSON.stringify({ text });
function genericEngine(endpoint, text) {
 // By default generic engine is async as synchronous requests are rate limited.
 if (typeof API_KEY === ‘undefined’) {
     return “Engine Disabled – API Key Required”;
 if (text === “” || text === null || text === undefined) {
   return “Engine Disabled – Text Field Required”
 let req = new EnginesRequest(endpoint);
 let body = JSON.stringify({‘text’ : [text]});
 return req.post_async(body)
var api_key_row = 2;
var api_key_col = 2;
var API_KEY = SpreadsheetApp.getActiveSheet().getRange(api_key_row, api_key_col).getValue();
Once you’ve pasted in the data be sure to save the script.

Extracting sentiment with Primer’s NLP models

Now that we have the logic in place that we’ll call from our spreadsheet. Let’s give it a try with the Primer’s Sentiment model (For a full list of models available, check out the API documents here:

In our sheet let’s put a cell next to the first review and use following function call:


Our cell will have an error stating we need to include the API key.

You can update the script to have the key defined, or you can update location the script is currently looking for the key: Column B, Row 2.

To get your Engines API key, if you haven’t already: Sign up here: Once you have an account you’ll have the ability to create your api key within the app:

With the key added in – and partially obscured 🙂 – you should now see the results displayed for your input data. A model’s magnitude, 0.97 in this case, shows the model’s confidence in this sentiment label. The closer to 1, the higher the confidence in the label. For full details on our Sentiment model, please see the documentation here:

You can now start getting a quick look at the data by dragging the function to include our other data cells.

Looking at the data manually, it looks like the bubble maker wasn’t a positive experience for one cat owner. We can run the Primer Key Phrase extractor model on the data to get additional details that may help us understand why. 

Just add the following code next to that cell and let’s take a look at phrases in this negative review.


Some keywords come to the surface: “motor”, “noisy”, “scares” that give us an idea of where this negativity is coming from. The span field shows us the exact character index the phrase was found in the text! Let’s add this code to the whole sheet so we’re both determining sentiment for every review and also extracting phrases to give us a dataset that we can parse and display for some high level insight.

Ok, we’ve extracted the sentiment and key phrases of the document. To make the data easier to parse, we can copy the following bit of JavaScript Apps Script we copied over. It will add some formatting logic.

class DefaultDict {
 constructor(defaultVal) {
   return new Proxy({}, {
     get: (target, name) => target[name] || defaultVal
function aggregatePhraseSentiment(sentiment_range, phrase_range) {
 if (typeof API_KEY === ‘undefined’) {
     return “aggregatePhraseSentiment script”;
 let positive_sentiment_phrases = new DefaultDict(0);
 let negative_sentiment_phrases = new DefaultDict(0);
 let phrase_seen_count = new DefaultDict(0); // for sort by value descending to show most popular phrase
 let phrase_name = “”
 for (const phrase_list_index in phrase_range) {
   phrase_dict = JSON.parse(phrase_range[phrase_list_index]);
   if (“phrases” in phrase_dict) {
     for (phrase_index in phrase_dict[“phrases”]) {
       phrase_name = phrase_dict[“phrases”][phrase_index][“phrase”]
       // check corresponding sentiment e.g. {“sentiment”: “negative”, “magnitude”: 0.973}
       sentiment_dict = JSON.parse(sentiment_range[phrase_list_index])
       if (“negative” === sentiment_dict[“sentiment”]) {
       else if (“positive” === sentiment_dict[“sentiment”]) {
 var popular_phrases_descending = Object.keys(phrase_seen_count).map(function(key) {
   return [key, phrase_seen_count[key]];
 popular_phrases_descending.sort(function(first, second) {
   return second[1] – first[1];
 results = [[“aggregatePhraseSentiment script”, “Phrase”, “Count Positive”, “Count Negative”]];
 for (phrase in popular_phrases_descending) {
   key = popular_phrases_descending[phrase][0];
   count_positive = positive_sentiment_phrases[key];
   count_negative = negative_sentiment_phrases[key];
   results.push([“”, key, count_positive, count_negative]);
 return results;

Now from our sample set reviews, we can see at a glance which features were positive and which were negative about the bubble maker. You can add as many reviews into this dataset as you’d like to understand the customer experience of a particular product. Or you could do this with any unstructured dataset for your organization. 

All Primer Engines can be called using the Google Sheet Apps Script function calls. Here’s a collection of a few Primer Engines being run on sample data.

+Link to the Primer Engines in Google spreadsheet.

For more information about Primer and to access product demos, contact Primer here.

Sometime in 2014, Bastian Obermayer, a reporter working for the Süddeutsche Zeitung newspaper, was sent 11.5 million leaked documents about off-shore financial operations, later known as the Panama Papers. It took over a year of analysis by journalists in 80 countries to dig into this immense corpus.

Making sense of the contents of a large set of unknown documents is relevant to many industry applications. In investigative journalism or intelligence operations, quickly identifying the most important documents could be a make-or-break effort. Other times, understanding an emerging or evolving domain can be valuable. For example, imagine mapping new debates on teenage use of digital devices and mental health, or the evolution of topics covered over the decades by popular newspapers.

We developed a prototype pipeline to help address this broad use case as part of our work with Vibrant Data Labs. The pipeline is designed to ingest an arbitrary set of documents, produce a hierarchical visualization of the contents, and finally make the corpus searchable by tagging each document with both specific and broad keywords. 

In this post, we’ll present the pipeline’s methodological design. As we’ll see, its harnessing of the power of deep NLP models, both open-source as well as available on Primer Engines, opens up new ways of tackling text analysis tasks.

Task overview

The task requires carrying out three main steps.

First, we need to understand what each document is about. Next, we want to look across the corpus to get a big-picture view of what it covers. Importantly, we’ll want to determine which domains are broad and which are more specific. Finally, we want to use this hierarchical lens to tag the documents in a consistent manner, thereby exposing them for search.

Let’s look at each step in turn.

Document representation

Document representation means converting a text document to a vector representing what the document is about. Commonly used approaches include applying a TF-IDF transformation or calculating document embeddings.

Both approaches, however, have some severe shortcomings for our purpose. TF-IDF offers interpretable representations but ignores semantic similarity. Document embeddings understand such semantics, but come at the price of opaque, and therefore not searchable, document representations. 

Our solution uses Primer’s Abstractive Topic Generation Engine. Its power comes from its understanding of both semantics and context, offered by its underlying deep language model, combined with the generation of plain-language outputs.

With this Engine, we are able to dramatically cut through the complexity of free text and reduce documents to a handful of selected, highly-relevant, and intelligible topic labels.

Get a high-level view across documents 

The next challenge is going from the individual document representations to a zoomed out view of the corpus content. 

Traditionally, one could tackle this by clustering similar documents together or using topic modelling techniques like Latent Dirichlet Allocation.

We take a different approach. Instead of grouping the documents, we work on the extracted topic terms and learn the relations in that set. To do so, we carry out two simple steps using off-the-shelf tools:

  1. To measure semantic distance between terms, we project these into a vector embedding space using SentenceBERT, an open-source sentence embedding model. 
  2. We use agglomerative clustering – a bottom-up hierarchical clustering method – to extract a tree data structure connecting related terms into ever broader groups sharing semantic proximity. 

This is how the pipeline can learn that ‘washing machine’ and ‘dishwasher’ are related, as are ‘oven’ and ‘microwave’, and as we look across a wider semantic radius, these will eventually fall in a common group. 

Understanding the hierarchy across domains is key to making the corpus searchable, as user queries will range from specific to broad. 

One crucial step is still missing though: how do we label this broader group? The richness of deep language models comes to our aid here. We’ve found that simply selecting the term that is most similar – based on embedding similarity – to the other terms in the group yields a pretty good representative item. 

Let’s look at the concrete example below. Strikingly, the term ‘saucepan’ which combines elements of both pots and pans, indeed emerges as the most central term. On the other hand, ‘wok’ and ‘teflon pan’ , which can be thought of as specific types of pans, are at the bottom of the ranking as representative terms.

RankTermCentrality Score
4teflon pan2.728998

Moreover, the selected representative terms become more conceptually broad and abstract as we seek to label more diverse groups of terms. We can see this behavior in the examples below, wherein the more abstract term is chosen as the most representative when set alongside two related but semantically distinct terms.

RankTermCentrality Score
RankTermCentrality Score
2table cloth1.887701

By virtue of this feature, these two simple steps allow building a structured view over topic space in the corpus, offering both narrow and broad perspectives.

Search documents using hierarchical relationships

Finally, we can proceed to tag the original documents and power the search functionality. Starting from the original document topic labels, we use the relationships in the term tree to ensure each document is also linked to corresponding, more abstract, domains. For example, depending on the extracted hierarchy, ‘microwave’ could also be tagged as ‘cooking’, ‘household items’, ‘consumer durables’, ideally with a declining relatedness score as we move further up these abstractions. Notably, this means that microwave products would now be picked-up in searches for both microwaves specifically as well as household appliances in general.

Augmenting insights into your data with deep NLP

We’ve tried this tool on different types of datasets and found it can provide valuable initial insights out-of-the-box. We’ve played around with book blurbs, news documents, academic papers on COVID-19, and documents made public by the CIA. In all cases, we were able, at a minimum, to get an immediate understanding of what the documents were about and to have a way of searching the documents we were most interested in. 

Clustering documents or extracting topics are not new tasks in the domain of unsupervised learning. However, the above workflow differs from these traditional approaches by drawing on the vast additional insight offered by deep language models, such as the Abstractive Topics Engine and SentenceBERT.  Without language models, one would be limited to making sense of documents only based on the distribution of features in the local corpus. Instead, modern NLP can interpret documents, even in small datasets, using the understanding gained over the vast training corpus that is embedded in the language model itself. 

This is the design choice of the Primer Engines, where powerful NLP models are exposed via an API to support the creation of composable NLP pipelines on customer documents. 

Are you curious to see how our hierarchical topic prototype works in practice? Have a look at our tutorial where we create a product inventory from item descriptions from Amazon.

Further reading

SentenceTransformers Documentation

Agglomerative Clustering example, Wikipedia

Agglomerative Clustering, Scikit-learn User Guide

AgglomerativeClustering implementation, Scikit-learn Documentation

Original paper introducing Hierarchical Latent Dirichlet Allocation

We create the tools behind the decisions that change the world. ©2022 Primer

Distilling the contents of a large set of unknown free text documents and understanding how they relate to each other is relevant to many industry applications. This tutorial will take you through the HierarchicalTagger, a pipeline of NLP models created to address this broad use case. 

Combining the power of Primer Engines with a custom prototype built on top of deep NLP models, the pipeline is designed to ingest an arbitrary set of documents, produce a hierarchical visualization of their contents, and finally make the corpus searchable by tagging each document with both specific and broad keywords. Check out this post if you would like to know more about how it works.

In this tutorial, we’re going to run the pipeline on a dataset from Amazon to create a product inventory from raw item descriptions. To run this tutorial, you’ll need to sign up for free for Primer Engines, our pre-trained models built for developers and data scientists.

At the end of this process, we’ll have a visualization like the one above, showing the hierarchy of topics covered by the product descriptions. We’ll also have a tagged and searchable document set containing both narrow and wide keywords. Finally, we will run a simple web app exposing both the visualization and search functionality to the user via a UI. 

Here we go!


We’ll start by getting access to the relevant code, data and additional packages. In a terminal, clone our GitHub repository and navigate to its root directory:

$ git clone$ cd primer-hierarchical-tagger

The full code for the workflow we’ll be going through here can be found in the examples folder, and we’ll use the webapp folder to spin up the app. If you want to have a look at the internals of the pipeline, checkout the code and comments in the HierarchicalTagger class.
Next, let’s create a separate environment for our code to run in. We used virtualenvwrapper, but you can use your favorite method as well. We ran the these commands to create the environment and install the required packages:

$ mkvirtualenv ht-repo
$ pip install --upgrade pip
$ pip install -r requirements.txt

The instructions in the requirements.txt will install all package dependencies, including the hierarchical_tagger module itself. 

Finally, download the Amazon Product Dataset 2020 and save it in the examples/data/ folder. We renamed the file to amazon_products_2020.csv.

We are now ready to launch a jupyter notebook. We do so from the root folder of the repository and add its path to the PYTHONPATH. This will allow us to call any python modules found in the root of the repository from inside the notebook.

$ PYTHONPATH=$(pwd) jupyter notebook

Open up the amazon-product-descriptions.ipynb in the examples folder. Run the cells under the Set-up section to make sure all required packages are imported and paths are set-up correctly. 

That’s it, we’re good to go!

Generate abstractive topics via Engines

We are ready to tackle the first substantive step in the pipeline: understanding what each document is about. Primer’s Abstractive Topic Generation Engine is very well suited for this step. Given a raw text document, the engine generates a handful of selected, highly-relevant and intelligible topic labels.

In practice, we would now hit the Primer APIs with batches of documents for processing and receive the desired results back. So that you can proceed directly to the next steps, we’ve done this for you, and included the processed results for a random sample of 3,000 products in this file. Feel free to save your Engines credits and proceed to the next section.

However, running the pipeline on your own data is easy and we’ve included everything you need to get going. 

First, sign-up for an Engines free trial here. You get 2,000 free credits, which will cover processing for up to 2,000 short documents. If you would like to try the pipeline on a larger document set, please email to request additional credits.

Once you obtain an API KEY, save it in as ENGINES_API_KEY="YOUR_ENGINES_API_KEY". This file is outside of version control so it won’t be revealed to others. This way you can also import the key into the notebook, instead of hard-coding it.

from credentials import ENGINES_API_KEY

Next, you’ll need to massage your documents into the standard format expected by the Abstractive Topic Generation Engine: a list of dictionaries with an id and a text key. This is how we did this for the Amazon Product Dataset. You would need to edit this line according to the format of your own data.

documents = [{"id": r["Uniq Id"], "text": r["About Product"]} for i, r in sampled_items.iterrows()]

We created the infer_model_on_docs helper function that will take care of the communication with the Engines API for you. There is a LOT going on under the hood in this function. API calls are asynchronous, which means ‘waiting time’ while expecting a result from the API servers can be used productively by carrying out other operations in the program, for example to trigger other concurrent requests, or process the results from previous requests made. This increases the document processing throughput. Additionally, the functions below also allow batching of documents, so that a single request can return results on multiple documents. Finally, we use the tenacity module to carry out automatic retries when facing transient errors. 

You can find the code to send the documents to the Abstractive Topics Engine below. The cell processes a chunk of documents at a time and saves the results to a file at each iteration (just to be extra safe!). While the code above hides away all the internal complexity of the API calls for the users’ convenience, it’s always a good idea to test the API call on 1 or 2 documents to check everything is in shape before triggering a job on a large list of documents. For a test run, simply replace the documents list with a small slice of the document set:

test_documents = documents[:2]

If all is in good shape, you can kick off the document processing:

ITEM_TOPICS = os.path.join(ROOT_DIR, "./examples/data/amazon_products.json")

topics = {}

# Infer topics from Engines
for doc_chunk in chunked(documents, 100):
    topics_results = await infer_model_on_docs(doc_chunk, 
                                               **{"segmented": False})
    print(f"Collected topics for {len(topics)} documents")
    # Save
    with open(ITEM_TOPICS, "w") as f:
        json.dump(topics, f)

It’s probably time to make a coffee. The Abstractive Topics Model would not get along well with the Macintosh you had as a kid: it’s one of those heavyweight NLP models with over 100 million parameters that need GPUs to run efficiently at scale. But after some minutes of waiting, you’ll be able to inspect the topic labels by product id in the topics dictionary.

Ingest the processed docs into the HierarchicalTagger pipeline

Whether you ran Engines on your own data or used our precomputed dataset, you’ll be able to access the document topic representation like this:


OUT: {'topics': ['T6', 'Hover Board', 'Hover Scooter', 'Off Road'],
 'segmented_topics': [['T6', 'Hover Scooter', 'Hover Board', 'Off Road']],
 'segments': ['Make sure this fits by entering your model number. | FOR ALL RIDERS – The T6 can handle up to 420 lbs., making it the best choice for riders of all shapes and sizes! | ALL TERRAIN - Roll over bumps and inclines up to 30° as you travel through mud, grass, rain, and even gravel. | 12 MILE RANGE - The T6 off road hover board has a 12-mile range, and the capability to reach powered speeds of up to 12 MPH. | 10" RUGGED TIRES - Dual rugged, 10" tubeless tires designed for all terrain exploration. | ROCK WHILE YOU RIDE –The self-balancing hover scooter uses Bluetooth to play music directly from your phone.']}

It’s time to start-up the pipeline and load in our processed topics. The code below creates a HierarchicalTagger instance. It might take some moments the first time you run it, as it will download the SentenceBERT language model.

from hierarchical_tagger.hierarchical_tagger import HierarchicalTagger
hierarchical_tagger = HierarchicalTagger()

Next, we send the documents and their corresponding topic labels for ingest:

document_topics = {document_id: topics_entry['topics'] for document_id, topics_entry in topics.items()}

This step is the most computationally demanding as it involves transforming all topic terms into a vector embedding space using the SentenceBERT language model. This is a fundamental step as it will allow the pipeline to measure semantic distance between terms. To avoid having to repeat this, we can save our HierarchicalTagger instance to a json file, using the .to_json() helper method. This file will also be the input data to our web app, so let’s save it in webapp/data/:

SERIALIZED_INSTANCE_PATH = os.path.join(ROOT_DIR, "./webapp/data/amazon_products.json")
with open(SERIALIZED_INSTANCE_PATH, "w") as f:

If we ever want to load up our instance again at a later date, we can simply run:

with open(SERIALIZED_INSTANCE_PATH, "r") as f:
    reloaded_serialized =  json.load(f)
hierarchical_tagger = HierarchicalTagger.from_dict(reloaded_serialized)

Build the topic tree and tag the documents

Next, we want to look across the corpus to get a big-picture view of the topics it spans and how these relate to each other. In particular, we want to learn the hierarchical relationships between the topics. With that goal, we use agglomerative clustering – a bottom-up hierarchical clustering method – to extract a tree data structure connecting related terms into ever broader groups sharing semantic proximity. 

The simplest way to try this is by calling the .fit_tag_tree() method. This populates the .tree attribute with a treelib object representing the extracted term tree. This can be manipulated and explored with all the treelib methods, for example .show() to print out a text representation of the tree.

├── crafts
│   └── vehicle
│       ├── cars
│       │   ├── 4wd monster truck
│       │   │   └── monster truck
│       │   ├── automotive industry
│       │   │   ├── automotive design
│       │   │   └── ford mustang
│       │   ├── car racing
│       │   ├── cars cars 3
│       │   │   └── toy cars
│       │   ├── hover board
│       │   │   ├── skateboarding
│       │   │   │   ├── chalkboard
│       │   │   │   └── skates

The final step is tagging the original documents based on the hierarchy we found in the tree, and exposing them for search. Once again, a default call to .tag_documents() will do the trick. The results will be in the .document_tags attribute: a dictionary mapping document id to a list of tuples of the form (term, score, node_id) sorted by descending score. score measures how close in meaning the term is to the document. We would expect higher level abstractions to have lower scores. node_id loosely indicates how high the node is in the tree: it’s not a perfect measure, but more abstract terms will generally have higher node ids.

ht.document_tags # {doc_id : [(tag, score, approximate hierarchy level), ...]}

Here’s how the pipeline performed for the ‘hover board’ item we saw above.


OUT: [('hover scooter', 0.5336780615328567, 1221),
 ('skateboarding', 0.5263492531558106, 1654),
 ('electric scooter', 0.4128446044700507, 1766),
 ('skates', 0.3843635235742917, 1824),
 ('sports', 0.2278761348235242, 1933),
 ('car', 0.17362573115847066, 1981)]

As you can see, the top-scoring tag is spot-on to the specific item description. Beyond that, although not perfect, the pipeline partially succeeds in mapping the item to higher level concepts that, importantly, were not present in the original document representation. It assigns a medium-strength (0.41) tag of ‘electric scooter’, linking the document with other items related to the broader concept of electric mobility. Similarly, the ‘skates’ tag establishes a link with other highly-related sporting equipment. The pipeline also makes a very accurate link to the much broader domain of ‘sports’, with a low score correctly measuring the notable degree of abstraction between the specific item and this high-level concept. Having this sort of tag immediately translates into an improved search experience.

Tuning and human-in-the-loop

Of course, some things are off, with the low-scoring tag of ‘car’ not really fitting in with the hoverboard example. It would be great if the AI ‘just got it’ out-of-the-box, but that’s generally not how it happens. Instead, our aim is for the tool to kick-start, and then significantly enhance, the investigation efforts of the human-in-the-loop. 

To this end, we expose several tuning parameters that the investigator can tweak to guide the extraction of the term tree and the logic applied when tagging the documents. The analyst can also feed their own domain knowledge by suggesting additional terms to be included in the tree. This input could even be estimated from a different corpus; what we might think of as a loose form of transfer learning. 

For example, one could run the pipeline on descriptions of toys from Amazon to extract an initial term tree, then feed that set of terms as suggestions to the pipeline when analyzing a corpus of letters to Santa. 

The point here is that the exploration workflow is likely to be iterative: starting from a set of unknown documents, the investigator can repeatedly run the pipeline, using the pipeline options as levers to guide the results toward better results each time.

Additionally, the tagging from this tool can be used to generate pseudo-labels to train custom models. For example, say you need a classifier to identify sports items in a larger set of product descriptions. Just using a few sports-related tags from the pipeline will immediately give a pseudo-labelled training dataset. Uploading that data into Primer Automate, you could have a trained model in just a few clicks.

Exploring the corpus with the web app

We have created a simple web app to facilitate this iterative exploration. Running the following command will launch the app at the http://localhost:8501/ address.

$ workon ht-repo # Or alternative command to activate your virtual environment
$ streamlit run webapp/

If you saved the HierarchicalTagger instance in the previous step, you will find the amazon_products.json option from the Datasets drop-down in the left sidebar. Use the sidebar to make sure you are in the Tag Tree view, and you will see a sunburst visualization of the topics in the corpus, and how they have been grouped hierarchically. The chart is interactive, so you can click on a node to zoom into its descendants. Using the sliders in the sidebar, you can change the parameters for the fitting of the tree and see the effects in the visualization immediately. Increase the minimum document frequency to prune the smaller leaves in the chart; increase the minimum similarity threshold to push the tree to split branches more easily. 

Once you are satisfied with the structure of the tag tree, switch the view to Document Search to use the tags to search the corpus. After choosing a tag in the dropdown, the page will return the most relevant documents from the corpus and display their raw topic labels. Here too there are some parameters one can tweak to guide how documents are tagged. You can increase the minimum abstraction similarity if you notice that documents are being assigned too generously to tags, especially broad ones. Similarly, if you notice that documents are being assigned to tags that are peripheral to the document focus, try increasing the minimum tag score.

Let your creativity loose!

The initial iteration of this tool came out of our work with Vibrant Data Labs to create a searchable map of companies and organizations working on solutions to the challenges posed by climate change.

As we reach the end of this walk-through, we hope we’ve managed to trigger your curiosity to try the pipeline on some other data that is important to you. Indeed, we are releasing the pipeline and example code as we are confident it will be useful across a variety of domains. 

Of course, the pipeline can be improved in many ways. One improvement could be functionality for the user to edit the tag tree after fitting (like moving a branch onto another) or to impose constraints on how the tree can grow (for example, imposing that ‘board games’ should be a sub node of ‘toys’). 

That said, we’ve played around with book blurbs, news documents, academic papers on COVID-19, and documents made public by the CIA and found that, although far from perfect, it can deliver a lot of insight just out-of-the-box. Can you think of another dataset where the tool could help? Sign-up to an Engines free trial here and try it out!

We create the tools behind the decisions that change the world. ©2022 Primer

What do Qassem Soleimani, Mohammed Bin Salman, and Abdel Fattah el-Sisi have in common?

If you have been following news in the Middle East and North African region, you probably guessed correctly. These are the names of highly influential figures in Middle Eastern geopolitics over the last several years.

Another thing they share is that there are numerous transliterations for each name. For example, Qassem Soleimani can also be written as Qasim Soleimany or Qasem Suleimani or Kasem Suleimany, along with several other similar-sounding alternatives. Different spellings that share the same pronunciation are also known as homophones.

Variations among transliteration guides and ad-hoc decisions among content publishers result in multiple spellings for the same individual across a large corpus of text. If you are an analyst, you know this problem all too well. To run a simple search on a person of interest, you will find yourself writing complex boolean expressions like the ones shown below. Sample Query #1 shows some spelling variations you would write if you were running a search on the airstrike that led to the death of Qassem Soleimani. Meanwhile, in Sample Query #2, the search is for Egyptian President Abdel Fattah el-Sisi’s policy in the Sinai Peninsula.

Sample Query #1

("Qasem Soleimani" OR "Qasem Suleimani" OR "Qassem Soleimani" OR "Qassem Suleimani" OR "Qassim Soleimani" OR "Qassim Suleimani") AND ("death" OR "airstrike")

Sample Query #2

(“Abdel Fattah el-Sisi” OR “Abd el-Fattah el-Sisi” OR “Abdul Fattah el-Sisi” OR “Abd al-Fattah el-Sisi" OR "Abdel Fattah el-Sisy" OR "Abd el-Fattah el-Sisy" OR "Abdul Fattah el-Sisy" OR "Abd al-Fattah el-Sisy") AND ("Sinai Peninsula" OR "Sinai") AND "policy"

In the two queries above, we can see all the different spellings an analyst would have to manually generate about and then write out in a boolean expression to effectively capture the range of spellings for a given name. These booleans can become incredibly complex very quickly if an analyst wants to maximize the recall of their results and capture the full universe of potential spellings for a transliterated name. For example, we found the following spellings for the name Muhammad: Mohammad, Muhammad, Muhammed, Mohamed, Mohamad, Muhammad, Muhammed, Muhamed, Muhammed, Muhamet, Mukhammad, Maxamed, Mamadou, among others.

At Primer, we have built tools into our platform to do this heavy lifting for you. Our industry-leading Named Entity Recognition models can extract mentions of people in news articles and in your custom data sources.

To unburden analysts from the need to generate the universe of potential transliterations when searching for a person, we built a custom feature that generates name spelling variations for names that are transliterated from a language with Arabic script. Using a rule-based algorithm that synthesizes variants of a given name, Primer automatically resolves the alternate spellings to a single person from the span of documents you are searching over.

We created this rule-based approach with the help of a linguistics expert, who helped outline the universe of ways that a name could be transliterated from the Arabic alphabet to Latin script. For example, one rule that we use is substituting the letter “Q” at the beginning of a name with a “K”. Thus, when we encounter the name Qassem, we know that Kassem is a valid alternative. Another rule is the substitution of “ss” in the middle of a name with a single “s”. Again, using the example of Qassem, we end up with Qasem as an additional spelling.

If we use only the two example rules above, and recursively apply them to the name Qassem, we generate Qassem, Qasem, Kassem, and Kasem as valid alternate spellings. Our algorithm employs more than 25 similar rules, which helped us establish a library of potential spellings that our platform searches against when it encounters a name transliterated from Arabic script.

We also took additional measures to ensure that the generated variants do not contain false positives that could resolve a name that is shared across English and Arabic. For example, the name May (also spelled Mai) is common in Arab countries. We added additional checks to ensure that a substitution from May to Mai isn’t made when a name like Theresa May is detected in a search.

Transliteration Search

With our homophone detection feature, we programmatically generate spelling variants and use them to enhance the results of your queries. That way, you don’t have to worry about manually generating all possible spellings of a name (or dealing with complex wildcard operators), and instead focus on reviewing your analysis and gleaning key insights as quickly as possible.

So how does Primer’s transliteration technology perform?

We set up a test with three former intelligence analysts. Their task was to write Boolean queries to find all the mentions of Qassem Soleimani, Mohammed Bin Salman, and Abdel Fattah el-Sisi within an unclassified dataset. We timed how long it took them to create a Boolean query for each key person.

Here are the results:

Muhammad Bin Salman

Time to Complete Total Documents
Analyst 1 4:30 12,452
Analyst 2 1:30 12,452
Analyst 3 13:30 10,149
Average 6:30 11,684
Primer Algorithm 0:10 12,696

Abdel Fattah el-Sisi

Time to Complete Total Documents
Analyst 1 5:00 2,527
Analyst 2 1:53 75
Analyst 3 8:00 70
Average 4:57 890
Primer Algorithm 0:10 4,286

Qassem Soleimani

Time to Complete Total Documents
Analyst 1 15:00 56,926
Analyst 2 1:34 21,679
Analyst 3 4:00 56,662
Average 6:51 45,089
Primer Algorithm 0:10 57,154

Primer was able to deliver 139% more results than the analysts on average across all queries and was able to reduce the average time to query from 6m 06s to < 10 seconds. If you’re searching for 15 people, that is over an hour of time savings to be gained. Imagine the time lost by a team of analysts running dozens of queries every day.

Arabic Transliteration in Action

Extract from Primer Platform query on Qassem Soleimani, highlighting the different spellings of his name.

Analysts using this feature in Primer can now ensure that they are returning all the relevant results and that they are not missing any critical documents. The automation here will also save the analyst hours on a typical workflow.

If your organization can benefit from extracting information from textual data, we’d love to chat. To learn more about the solutions and features we’ve built with our natural language processing technologies, you can reach out to our team at


Moby Dick


Automatic text summarization is one of the most challenging and most valuable commercial applications of natural language processing. Saving the typical business or intelligence analyst even just half an hour per day of unnecessary reading is worth billions of dollars.

Progress has been stalled by a bottleneck: We still rely on human data labelers to evaluate the quality of machine-generated summaries because automatic algorithms aren’t good enough. Human readers are just too slow and costly. If you can’t measure performance on a task—automatically, accurately, and at scale—then you can’t teach machine learning models to do that task.

What’s needed is a machine reader to evaluate the work of the machine writers.

Here at Primer, we’ve created a machine learning system that goes a long way towards breaking the bottleneck. We call it BLANC. It uses a deep learning language model to directly measure how helpful a summary is for understanding the target document.

You can read our research paper about BLANC on arXiv and find our open source implementation on GitHub.

The problem

When a machine writes a summary of a document for you, how do you know if it did a good job? You could read the document yourself and check, but the whole point of a summary is to save you that time.

You can evaluate the machine’s performance using a benchmark data set. The industry standard is a collection of 300k news articles published by CNN and the Daily Mail that include human-written summaries. You can test the machine on this data by applying one of the standard algorithms: BLEU and ROUGE. They score your machine-generated abstracts by measuring the text overlap with the human-written summaries. The more overlapping words and phrases between the summary and the original document, the higher the score.

This method is easy to implement, but it has two big problems. The first is that if a summary uses new words that are not in the original document, then it is penalized even if the summary is a masterpiece. Secondly, if the type of summary is different from the CNN/Daily Mail style of summary, then you’ll need to create a diverse set of human-written reference summaries of your own. This costs a lot of time and money to produce.

The more fundamental problem is that agorithms like BLEU and ROGUE judge summaries without even looking at the documents being summarized. This makes them so limited that you’re better off reading the texts and checking the summary quality yourself.

A solution

Ultimately, the goal of summarization is to help a reader understand a document by giving her the gist. So an ideal method for summary evaluation would simulate a human reader trying to understand the target document.

Our solution to this problem at Primer was to create BLANC. We named it as a Francophilic successor to BLEU and ROUGE.

BLANC simulates a human reader with BERT, a language model that was trained on a fill-in-the-blank game on the text of Wikipedia and digitized books. (Or as we call it, a fill-in-the-BLANC game.)

Out of the box, the BLANC method measures how well BERT guesses hidden words in the target document. Then it measures how much of a performance boost it gets from having access to the summary. In another version of BLANC, the performance boost comes from first fine-tuning BERT on the summary.

Unlike BLEU and ROUGE, BLANC requires no human-written reference summary. The document and summary go in, and the summary quality score comes out. BLANC makes it possible to use any domain or style of text from the underlying language model encountered in its original training.

How good is BLANC?

Can BLANC really judge a summary by its semantic content? Or is it just a fancier version of ROUGE that gives high marks to summaries that contain keywords or phrases from the document?

We tested the performance of BLANC by gradually corrupting summaries. Then we deliberately added noise by replacing more and more of the summary text with random words or sentences from the source document. The results shown in figure 6 of our paper are remarkable. For BLANC, the less meaning a summary has—even while retaining the vocabulary and turns of phrase of the document—the lower its BLANC score.


BLANC graph


Consider this news document, which you don’t have time to read.

Horsemeat is found in cans of beef: Thousands of tins on sale in discount shops removed from shelves after discovery

Here is a human-written reference summary:

Human-written summary
Horsemeat was found in a canned beef product sold at discount chains.
Food watchdog says the sliced beef in rich gravy was made in Romania.
It was found to contain between one and five per cent horsemeat.

ROUGE: 1.0
BLANC: 0.106

The reference summary has the maximum possible ROUGE score of 1.0, simply because ROUGE assumes that it is perfect. But is it? The summary gets a BLANC score of 0.106. This can be loosely interpreted as the summary “helping” BERT to understand the document 10% better than it would have without it.

Here are two different summaries for the same document:

Summary #1
A tinned beef product in the UK has been withdrawn from sale.
The Food Standards Agency (FSA) found Food Hall Sliced Beef in Rich Gravy to contain 1% to 5% horsemeat DNA.
The FSA findings relate to one batch produced in January 2013.
The tinned beef is sold in Home Bargains and Quality Save stores.

ROUGE: 0.283
BLANC: 0.207

Summary #2
Beef found in canned horsemeat product sold at discount chains.
Romania says sliced beef in rich gravy made in Watchdog.
It was found to contain between one and five horse per tin.

ROUGE: 0.754
BLANC: 0.053

Summary #1 is clearly better than #2 by any reasonable standard. It is more informative than the original summary. It also makes none of the grave factual errors of summary #2. The BLANC score reflects this: Summary #1 is twice as helpful for understanding the document as the original human-written summary, and four times more helpful than summary #2.

Yet the ROUGE scores tell exactly the opposite story. Simply because summary #2 shares the vocabulary and turns of phrase of the human-written reference summary, its ROUGE score is more than twice higher.

Super-human reading and writing

Ultimately, we care how useful a summary is to human readers. To test how well correlated BLANC is with human evaluations, we worked with (We wholeheartedly recommend Odetta to the machine learning community; they are as much our data science partners as service providers.)

The Odetta team scored a diverse sample of news article summaries on multiple quality dimensions. The scores for the same summaries from BLANC put us right in the middle of that distribution: BLANC is a machine reader that judges the quality of summaries as well as a trained human annotator. (See figure 8 in our paper.)

This first version of BLANC takes a few seconds to judge the quality of a typical 800-word news document summary. It takes a human evaluator several minutes to do the same task. Then, it takes 5 to 10 minutes for a human to write the summary, which takes a second for a machine.

With further refinement, we expect BLANC to achieve superhuman skill at judging the quality of document summaries. That will help us train machine writers which, finally, will spare humans from having to read everything themselves.

You can read more about the implications of our research: Human-free Quality Estimation of Document Summaries.

If you’re doing anything that involves text summarization, we would love to hear from you. Drop us a note.

Here at Primer we are building machines that can read, write, and understand natural language text. We measure our progress by breaking that down into smaller cognitive tasks. Usually our progress is incremental. But sometimes we make a giant leap forward.

On the reading task of Named Entity Recognition (NER) we have now surpassed the best-performing models in the industry by a wide margin: with our model achieving a 95.6% F1 accuracy score on CoNLL. This puts us more than two points ahead of a recently published NER model from Facebook AI Research. More importantly, we are now on par with human-level performance. It requires consensus across a team of trained human labelers to reach higher accuracy.


NER performance


Primer’s NER model has surpassed the previous state of the art models of Google and Facebook on F1 accuracy score. Graph adapted from Sebastian Ruder, DeepMind.

NER: What’s in a Name?

Named Entity Recognition (NER) is a foundational task in Natural Language Processing because so many downstream tasks depend on it. The goal of NER is to find all of the named people, places, and things within a text document and correctly classify them.

The gold standard benchmark for NER was laid out in a 2003 academic challenge called CoNLL. The CoNLL data set consists of news articles with all of the named entities hand-labeled by humans. (There is also a German-language CoNLL data set.) This established the four standard NER classes: person (PER), organization (ORG), location (LOC), and miscellaneous (MISC).

The NER labeling task is not as easy as it sounds. Consider this sentence:




After a thoughtful pause, a human reader can deduce that “Paris Hilton” is a person, “the Hilton” is an organization, and “Paris” is a location. (Humans will disagree about 15% of the time whether “the Hilton” should instead be classified as a location.)

A popular industry solution for extracting named entities from text is spaCy. Here is the output of spaCy 2.1 NER:


I_heard_that_Paris 1


Not bad. The spaCy model does correctly identify all of the named entity spans. And it correctly identifies the second “Hilton” and second “Paris” as an organization and location, respectively. But Paris Hilton herself is misclassified as an ORG. So spaCy is only getting 66% accuracy on this text. And on our diverse gold-labeled NER data spaCy 2.1 falls well below 50% accuracy.

In order for models to be useful in a commercial setting, they need far better performance. So some new ideas are needed here.

New models are good, but data diversity is king

To create our own NER model we started with a BERT-based architecture and fine-tuned it for NER with the CoNLL training data. By switching to a universal language model like BERT, we immediately left spaCy in the dust, jumping an average 28 points of precision across all entity classes.

However, that higher precision came at a cost in recall. For example, Primer’s BERT-NER model was not confident enough to tag “Paris Hilton” in this sentence:




Pushing our NER model beyond state of the art required two more innovations. First, we switched to a more powerful universal language model: XLNet. But we discovered that even larger performance gains are possible through data engineering.

The CoNLL NER data set is limited to just one type of text document: Reuters news articles published in 1996 and 1997. This is very low data diversity compared to the internet-scale corpus of documents we process at Primer. We needed our NER model to be trained on a far broader range of writing styles, subject matter, and entities. So we curated a highly diverse group of gold-labeled documents, including entities from the financial, defense-related, and scientific worlds.

Injecting this diversity into our training data made all the difference. Even adversarial examples rarely stump Primer’s NER model:




Since the first universal language models like BERT came out one year ago, we’ve seen a revolution in the field of natural language processing. You can see this rapid progression in the graph above. But take note where Primer’s NER model lands. Our performance on CoNLL stands above the best results published by the enormous research teams at Google, Facebook, and the entire academic community. We have made more progress on NER over the past two months than the entire machine learning field has achieved in the past two years.


Named Entity Recognition (NER)


Primer’s NER model is approaching human-level performance. We find that individual humans disagree on consensus NER labels 15% of the time on average, even after training.


Of the four Entity groups, PERSON extraction has the highest performance with 0.94 precision and 0.95 recall. Location extraction is the second highest with organization third and Miscellaneous the fourth highest ranking. These results mirror the performance of our human evaluators against gold standard data, with humans having the lowest inter-annotator agreement on the miscellaneous and organization categories.

Putting NER to work

So what can you do with the world’s best NER model?

Primer powers analyst workflows in some of the largest organizations in the world. Better NER translates to better downstream natural language processing. It powers coreference resolution to correctly attribute every quote by every politician and CEO in the world. You need it for relation extraction to convert unstructured text that describes entities into structured data—facts about people, places, and organizations. And for text classification, for example identifying corporate risks hidden deep inside a company’s financial documents.

To see how NER works on a text document, consider this transcript of Mark Zuckerberg’s congressional testimony. It takes about 5 seconds to process the 50,000+ words with Primer’s NER model and extract 271 people, places, organizations, and even named entities such as Facebook’s Libra cryptocurrency project. (See the output below.)

For a deeper stress test we’ve been running it on document types that it has never seen before. For an extreme test we turned to Harry Potter fan fiction novels. Because if our model deduces that the Noble House of Potter is an organization, Phobos Malfoy is a person, Libere Loqui is a miscellaneous entity, and Snog Row is a location, then extracting the named entities from business documents should be a walk in the park.

So how does it do? We’re glad you asked. We ran the experiment and here are the results. Below is the output from Zuckerberg’s congressional grilling.

NER output from Mark Zuckerberg’s congressional testimony, 23 October 2019


Cheryl Tipton Perlmutter
Vargus Congressman Casten
Martin Clay Mister
Mister Meeks Cleaver Barr
Hawley Phillips Inaudible
Warner Davidson Chan Zuckerberg
Vargas Gonzalez Mark Zuckerberg
Beatty Riggleman Bud
Chair Waters Pressley Axne
David Zuckerman
Luetkemeyer Huizenga
Mr Zuckerberg Kostoff
Emmer Speaker
Scott Foster
Hollingsworth Ms. Waters
Gonzales Presley
Himes Gottheimer
Green Stivers
Louis Brandeis Loudermilk


FinCEN EU Hamas
Liberty European NCMEC
HUD Facebook ACLU
PayPal congress BFSOC
SEC Newsweek Visa
Alipay Congress Dais
FTC Calibra Rand
FBI Twitter Uber
AML BSA Crypto
Google Lyft ICE
Subcommittee on Oversight and Investigations eBay Nazi
WhatsApp MDIs CFTC
Collibra First FHFA
UN Super LLCs
FINCEN Black Lives Matter Pew Research Center
Dias The New York Times Bookings Holdings
House Fintech Task Force The Washington Post
Ebay Federal Reserve Financial Stability Board
NAACP US Treasury The Department of Justice
US Department of Housing and Urban Development Facebook Libra Cambridge Analytica
G7 Black Mirror Financial Services Committee
MasterCard Congresswoman Hezbollah
Anchorage Trust AI Task Force Senate Intelligence Committee
Trump Hotel Capitol Hill Independent Libra Association
Muslim Advocates Mercado Pago National Fair Housing Alliance
The Capitol Social Network Office of Secretary of Defense
The Daily Caller Trump Hotels Microsoft
regulators The Guardian National Center on Missing and
LIBOR Association New York Times Securities Exchange Commission
Libra Association Wells Fargo Committee on Financial Services
Libra association Congressmen Independent Fact-Checking Network
Labor Association US Congress International Fact-Checking Network
Georgetown European Union Instagram
Congressional United Nations Supreme Court
Trump International Hotel Department of justice Messenger
Federal Reserve Board Federal Housing Agency The Times
WeChat Pay Rainbow Push coalition Independent Association
Georgetown University Department of Justice Federal Trade Commission
Housing Rights Initiative terrorists


California Americas Venezuela
Iowa Michigan Asia
Pacific Arkansas North Korea
Washington U.S. North America
Myanmar Oklahoma Switzerland
Utah America Georgia
Indiana New Jersey Guam
Minnesota Alaska Germany
Washington DC Pennsylvania Florida
Illinois Africa Canada
U S Silicon Valley Cyprus
New York Washington, DC Russia
DC Texas Iran
US Christchurch France
Colorado Syria Ohio
Virginia Connecticut New Zealand
Tennessee South Dakota North Carolina
Missouri Massachusetts Turkey
United States of America China Europe
Kentucky United States Maryland
District Qatar


The President Colibra XRP
American Nazi Zuck Buck African Americans
AI Americans Indian Muslims
Stump Russian Anti
Sarbanes-Oxley American Dune
Libra Chinese Libra Project
Green New Deal Patriot Act Iranian
Libra White Paper Republicans Future
Russians Venezuelan Democrats
Democratic Hispanics

Is there really one NLP language model to rule them all?

It has become standard practice in the Natural Language Processing (NLP) community. Release a well-optimized English corpus model, and then procedurally apply it to dozens (or even hundreds) of additional foreign languages. These secondary language models are usually trained in a fully unsupervised manner. They’re published a few months after the initial English version on ArXiv, and it all makes a big splash in the tech press.

In August of 2016, for example, Facebook released fastText (1), a speedy tool for word-vector embedding calculations. Over the next nine months Facebook then released nearly 300 auto-generated fastText models for all the languages available on Wikipedia (2). Similarly, Google debuted its syntactic parser, Parsy McParseface (3) in May of 2016, only to release an updated version of the parser trained on 40 different languages later that August (4).

You might wonder whether multilingual NLP is thus a solved problem. But can English-trained models be naively extended to supplementary non-English languages, or is some native-level understanding of a language required prior to a model update? The answer is particularly pertinent to us here at Primer, given our customers’ focus on understanding and generating text across a range of multilingual corpuses.

Let’s begin exploring the problem by considering one simple NLP model type; word2vec-style geometric embedding (5). Word vectors are useful for training a variety of text classifier types, especially when the volume of properly labeled training data is lacking (6). We could for instance use Google’s canonical news-trained vector model (7) to train a classifier for English sentiment analysis. We would then extend that sentiment classifier to incorporate other languages; such as Chinese, Russian, or Arabic. That would require an additional set of foreign vectors, compiled in either an automated manner or through manual tweaking by a native speaker of that language. For example, if we were interested in Russian word embeddings, we could choose from either Facebook’s fastText automated computations, or from the Russian-specific RusVectores results (8), which have been calculated, tested, and maintained by two Russian-speaking graduate students. How would those two sets of vectors compare? Let’s find out.

RusVectores offers a multitude of vector models to choose from. For our analysis, let’s select their Russian news model (9), which has been trained across a corpus of nearly 5 billion words from over three years worth of Russian news articles. Despite the massive corpus size, the vector file itself is only 130 MB, which is one-tenth the size of Google’s canonical news-trained word2vec model (7). Partially, the discrepancy in size is due to the reduction of all Russian words to their lemmatized equivalence, with part-of-speech tag appended by an underscore. This strategy is similar to the recently published Sense2Vec (10) technique in which the varied usage of a word,
such as for example “duck”, “ducks”, “ducked” and “ducking”, gets replaced by a single lemma/part-of-speech combination, such as “duckNOUN” or “duckVERB”.


Russian-Natural-Language-Processing 1


Simplifying the vocabulary through lemmatization is more than just a trick to reduce dataset size. The lemmatization step is actually critical for the performance of the embedded vector in Russian (11). In order to understand why lemmatization is required, one need only look at the unusual role that suffixes play in Russian grammar.

Russian, like most languages, disambiguates the usage of certain words by changing their endings based on grammatical context. This process is known as inflection, and in English we use it to signal the proper tense of verbs.


Russian-Natural-Language-Processing 2


The inflection is how we know that Natasha’s purchase of vodka occurred in the past rather than the present or future. English nouns can also undergo inflection, but only for instances of plurality (“one vodka” vs “many vodkas”). In the Russian, however, noun inflection is significantly more prevalent. Russian word-endings help convey critical noun-related information, such as which nouns are the subjects and which are the objects within sentences. In English, such grammatical context is expressed through word order alone.


Russian-Natural-Language-Processing 3


The meanings of these two sentences are quite different, even though the words in the English sentences are identical. The Russian sentences, on the other hand, rely on inflection rather than word-order to communicate the noun relationships. Russian sentence A is the direct translation of English sentence A, and a simple suffix swap generates the nonsensical Sentence B. The relation between Natasha (Наташа) and the vodka (водка) is signaled by her suffix, not her position in the sentence.

The Russian dependence on suffixes leads to a higher total count of possible Russian words relative to English. For instance, let us consider the following set of phrases: “I like vodka”, “Give me vodka”, “Drown your sorrows with vodka”, “No more vodka left!”, and “National vodka company”. In English, the word vodka remains unchanged. But their Russian equivalents have multiple versions of the word vodka:


Russian-Natural-Language-Processing 4


Our use of vodka changes based on context, and instead of a single English vodka to absorb we now have four Russian vodkas that we must deal with! This suffix-based redundancy adds noise to our vector calculations. Vector quality will suffer unless we lemmatize.

With this in mind, let’s carry out the following experiment; we’ll load the RusVectores model using the python Gensim library (12) (13) and execute the similarbyword function on “водкаNOUN” (vodkaNOUN) to get the top ten words that are closest, in Russian vector space, to vodka.

We execute the experiment using the following set of simple python commands:

from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('rus_vector_model.bin', binary=True)
for word, similarity in word_vectors.most_similar(u'водка_NOUN', topn=10):
        print word

The results (and their translations) read as follows:


Russian-Natural-Language-Processing 5


This output is sensible. Most of the nearest words pertain to forms of alcohol, at least. Still, as a sanity-check, let us compare the Russian output to the top-10 nearest vodka-words within Google’s news-trained vector model (7): That ordered output is as follows:


Russian-Natural-Language-Processing 6


The five highlighted alcoholic beverages also appear within the Russian results. Thus, we have achieved consistency across the two vector models. A toast to that!

But now let’s repeat the experiment using Facebook’s Russian fastText vectors. The first thing we observe, even prior to loading the model, is that Facebook’s vector file (15) is 3.5G in size—more than twice as big as that of Google. The difference in file size makes itself known when we load the model into Gensim. The Facebook Russian vectors take over two minutes to load on a state of the art laptop. By comparison, the RusVectores model takes less than 20 seconds.

Why is the Facebook model so incredibly large? The answer becomes obvious as we query the ten nearest words to vodka. In the Russian fastText model they are as follows:


Russian-Natural-Language-Processing 7


Eight of them represent a morphological variation of the vodka poured into fastText. Additionally, certain common Russian-specific Unicode characters (such as the » arrow designating the end of a Russian quotation) have erroneously been added to crawled words. The redundancy of Facebook’s Russian vocabulary thus increases the size of the vector set.

In order to carry out a fair comparison with the RusVectores, we’ll need to use lemmatization in order to filter the redundant results. The top ten non-repeating nearest words in the Russian fastText model are:


Russian-Natural-Language-Processing 8


Shockingly, half these results correspond non-alcoholic drinks. How disappointing!

It seems that Facebook’s one-size-fits-all approach to model training performs quite mediocrely on Russian text. But hey, it provided a decent starting place. Certain auto-trained word vector models generate outputs that border on the absurd. Take for example Polyglot (16), which offers models trained on Wiki-dumps from 40 different languages, producing these outputs for vodka’s nearest-neighbors:


Russian-Natural-Language-Processing 9


Some language models should not be blindly trained on input data without first taking all the nuances of the language into account. So use and train these models in moderation, rather than downing 40 languages all in one go. Such model abuse will lead to a headache for you and all your international customers. Instead, please take things slowly in order to appreciate the beautiful distinctions of each language, like one appreciates a fine and savory wine.



In an earlier post, we looked at the problem and challenges of text summarization. We dove into an intuitive state-of-the-art deep learning solution and generated some sample results. The solution we presented was a sequence-to-sequence algorithm that read text inputs and learned to generate text outputs using recurrent neural networks. This class of algorithms is powerful and broadly applicable to other natural language problems such as machine translation or chatbots.

In this post, we’ll look at the practical steps needed to train a seq-to-seq model to summarize news articles. Beyond the training setup and code implementation, we’ll also look at some ideas to train the model much more quickly (decreasing time from three days to under half a day). When we’re through, we’ll be able to generate summaries like the following:

Waymo files for an injunction against Uber’s use of its self-driving tech

Waymo has taken the next step in its suit against Uber. It alleges that Otto founder Anthony Levandowski took the information while employed at Waymo when it was still Google’s self-driving car project. The suit is based on a very serious charge of intellectual property theft.

North Korea vows ‘thousands-fold’ revenge on US over sanctions

North Korea has vowed to exact “thousands-fold” revenge against the US. UN security council backed new sanctions on Saturday that could slash the regime’s $3bn in annual export revenue by a third. The measures target key revenue earners such as coal, iron, lead and seafood but not oil. The US secretary of state, Rex Tillerson, said on Monday that North Korea should halt missile launches if it wanted to negotiate.

As these seq-to-seq models and associated libraries are relatively new, we hope that sharing our learnings will make it easier to get started with research in this area.

Getting started

Deep learning framework

We’ll first need to choose a library for building our model. With the recent growth of deep learning, we have a number of good ones to choose from, most notably TensorFlow, Caffe, Theano, Keras, and Torch. Each of these supports seq-to-seq models, but we will choose to use TensorFlow here due to its

  • Size of community and adoption
  • Flexibility in expressing custom mathematical relationships
  • Ease-of-use with good feature support (GPU, training visualization, checkpointing)
  • Seq-to-seq code examples

Unfortunately, TensorFlow is also actively changing, not always fully documented, and can have a steeper learning curve, so other options could be appealing if these are important factors.

If you’re new to TensorFlow, check out the main guides. The rest of this post will use code examples, but the high-level concepts and insights can be understood without prior knowledge.


GPU computing is really critical for speeding up training of deep learning models (sometimes on the order of 10X faster than CPUs). We can use GPUs either through a cloud-based solution like AWS or FloydHub, or by using own GPU computer; while cloud options are no longer considered cheap or even fast, they’re quick to set up. Spinning up a p2.xlarge instance on AWS with a deep learning AMI takes minutes, and TensorFlow will come installed and able to use the GPU.


The next step is to get a data set to train on. The CNN / Daily Mail Q&A dataset is the most commonly used in recent summarization papers; it contains 300,000 news articles from CNN and Daily Mail published between 2007 and 2015, along with human-generated summaries of each article. The articles cover a broad range of common news topics and are on average 800 tokens (words and punctuation characters) long. The summaries are on average 60 tokens (or 3.8 sentences) long. Another data set of similar scale is the New York Times Annotated Corpus, although that has stricter licensing requirements.

We can download the data and use a library such as spaCy to prepare the dataset, in particular to tokenize the articles and summaries. Here’s a sample summary we found:

Media reports say the NSA tapped the phones of about 35 world leaders. Key questions have emerged about what Obama knew, and his response. Leaders in Europe and Latin America demand answers, say they’re outraged.

Implementing a seq-to-seq model

We’ll demonstrate how to write some key parts of a seq-to-seq model with attention, augmented with the pointer-copier mechanism See et al., 2017. (Check out our earlier post for an overview of the model). The model is approximately the state-of-the-art, in terms of the standard ROUGE metric for summarization. (Note: this metric is fairly naive, as it does not give credit to summaries that use words or phrases that aren’t in the gold-standard summary, and it does not necessarily penalize summaries that are grammatically incorrect).

Batched inputs

The input data that we feed to our model during training are going to be the sequence of tokens in an article and its corresponding summary. Although we start off with token strings, we will want to pass in token IDs to the model. To do this, we define a mapping from the 50K most common token strings in the training data to a unique token ID (and a special “unknown” token ID for the remaining token strings).

During training, we want to feed in a batch of examples at once, so that the model can make parameter updates using stochastic gradient descent. We thus define the placeholders to be 2D tensors of token IDs, where rows represent individual examples and columns represent time steps.

# The token IDs of the article, to be fed to the encoder.
article_tokens = tf.placeholder(dtype=tf.int32, shape=[batch_size, max_encoder_steps])
# The number of valid tokens (up to max_encoder_steps) in each article.
article_token_lengths = tf.placeholder(dtype=tf.int32, shape=[batch_size])

# The token IDs of the summary, to be fed to the decoder and used as target outputs.
summary_tokens = tf.placeholder(dtype=tf.int32, shape=[batch_size, max_decoder_steps])
# The number of valid tokens (up to max_decoder_steps) in each summary.
summary_token_lengths = tf.placeholder(dtype=tf.int32, shape=[batch_size])

Note that our sequentially-valued placeholders accept fixed-size tensors, where the length of the tensors in the sequence dimension are chosen hyperparameters (max_encoder_steps , max_decoder_steps). For sequences that are shorter than these dimensions, we pad the sequence with a special padding token.

Defining the model graph

Here’s a start to defining the TensorFlow graph, which transforms the input batch of article tokens to the output word distributions at each time step.

# Embedding
embedding_layer = tf.get_variable(
    shape=[vocab_size, embedding_dim],
encoder_inputs = tf.nn.embedding_lookup(embedding_layer, article_tokens)
decoder_inputs = tf.nn.embedding_lookup(embedding_layer, summary_tokens)

# Encoder
forward_cell = tf.nn.rnn.LSTMCell(encoder_hidden_dim)
backward_cell = tf.nn.rnn.LSTMCell(encoder_hidden_dim)
encoder_outputs, final_forward_state, final_backward_state = (
        # Tells the method how many valid tokens each input has, so that it will
        # ignore the padding tokens. article_token_lengths has shape [batch_size].

# Decoder with attention

# Compute output word distribution

The embedding layer allows us to map each token ID to a vector representation (or word embedding). The token embeddings are fed to our bi-directional LSTM encoder, using the high-level method tf.nn.bidirectional_dynamic_rnn().

There’s some more work needed to complete the model definition; the code from See’s paper is on GitHub, as is this more concise example.

Training operation

One common loss function for seq-to-seq models is the average log probability of producing each target summary token. To get our training operation, we need to compute the gradients of the loss. When dealing with sequential models, it’s possible that the gradients can “explode” in size, since they get multiplied at each time step through backpropagation. To account for this, we “clip” or reduce the gradient magnitude to some maximum norm. These clipped gradients can be used to update the variables using an optimization algorithm (we choose Adam here).

    # Define loss
    loss = tf.reduce_mean(target_log_probs)
    tf.summary.scalar('loss', loss) # save loss for logging
    # Compute gradients
    training_vars = tf.trainable_variables()
    gradients = tf.gradients(loss, training_vars)
    # Clip gradients to a maximum norm
    gradients, global_norm = tf.clip_by_global_norm(gradients, max_grad_norm)
    # Apply gradient update using Adam
    optimizer = tf.train.AdamOptimizer()
    train_op = optimizer.apply_gradients(zip(gradients, training_vars))

Training loop

After defining the model, we run training steps to learn good parameter values, by passing in batched inputs and asking the model to run the training operation. There are challenges with running long training jobs; using the Supervisor class here helps us in two ways:

  • Saves checkpoints of the model during training, so that we can stop and restart jobs
  • Saves summary values like loss to help visualize progress of training

    sv = tf.train.Supervisor(saver=tf.train.Saver(), **kwargs)

    with sv.prepareorwaitforsession() as sess:
    while True:
    batchdict = batcher.nextbatch()

        # Run training step and fetch some training info
        _, summaries_value =
                # 'summaries' is a string from the tf.summary module that saves
                # specified fields during training. Useful for visualization.
        # Writes summary info that can be viewed through TensorBoard UI.


Here are some sensible values for key hyperparameters, based on See’s paper and other recent literature.

vocab_size = 50000
embedding_dim = 128
encoder_hidden_dim = 250
decoder_hidden_dim = 400
batch_size = 16
max_encoder_steps = 400
max_decoder_steps = 100

Training speed

With our model now defined, we’re ready to train it. Unfortunately, deep learning methods typically take a long time to train, due to the size of the data and the large number of model parameters.

As See notes in her paper, we can begin training on shorter inputs and labels. While the algorithm will ultimately use the first 400 tokens of each article and the first 100 tokens of each summary for training, the initial training steps do not need to utilize all of that information. By starting off with smaller examples (e.g. 150 token articles and 75 token summaries), we can run iterations that take less time, leading to overall faster convergence.

Here’s a graph from TensorBoard of the training loss as a function of training iterations. Note that we increased article token size from 150 to 400 at around iteration 55K, when the loss started to flatten out. The change noticeably decreases the loss, but also decreases speed from .85 to .6 training steps per second. In total, the training took nearly two days with around 6 epochs (See states that fully training the model took 3 days).




Progress of training loss




Speed of each training iteration

The long training cycle is not only an inconvenience; training time is a huge bottleneck for experimenting with new ideas as well as uncovering bugs, which in turn is a bottleneck for making research progress.

We’ve discovered some additional tricks to drastically reduce training time. The time to train a new model from scratch can be reduced under these circumstances to under half a day, with no practical decrease in summary quality. In some cases, we can even reuse trained models to initialize a new model, and then need only a few further hours for training.

Use pretrained word embeddings

Let’s take a look at where a large fraction of our model parameters are, to see if we can optimize how we use them. The word embedding layer, which maps token IDs to vectors, supplies inputs to the encoder and decoder. This size of this layer is vocabsize∗wordembeddingdim\mathsf{vocab\_size * word\_embedding\_dim}, in our case 50K∗128≈6 million50K∗128≈6 million50\text{K} * 128 \approx 6\text{ million}. Wow!

While the embedding layer can be randomly initialized, we can also use pretrained word embeddings (e.g. GloVe) as the initial embedding values. This way we begin training with a sensible representation of individual words and need fewer iterations to learn the final values.

pretrained_embedding = np.load(pretrained_embeddings_filepath)
assert pretrained_embedding.shape == (vocab_size, embedding_dim)
embedding_layer = tf.get_variable(
    shape=[vocab_size, embedding_dim],

Reduce vocab size

To simplify the embedding layer even further, we can reduce the vocab size. Looking at the tail end of our original 50K word vocabulary, we find words like “landscaper” and “lovefilm”, which appear in the training set less than a hundred times. It’s probably difficult to learn much about these words when the label for each article is a summary that may or may not utilize these words, so we should be able to reduce the vocab size without incurring much loss of performance. Indeed, reducing vocab size to 20K seems fine, especially if we replace the out-of-vocab words with a token corresponding to its part-of-speech.

Use tied output projection weights

The other large component of the total trainable parameters is in the output projection matrix WprojWprojW_{proj}, which maps the decoder’s hidden state vector (dimension hhh) to a distribution over the output words (dimension vvv).




Using the projection matrix to map hidden state to output distribution

The number of parameters in WprojWprojW_{proj} is v⋅h=50K∗400≈20 million(!!)v⋅h=50K∗400≈20 million(!!)v \cdot h = 50\text{K} * 400 \approx 20\text{ million}(!!)in our case. To reduce this number, we can instead express the matrix as




Factorization of the projection matrix

where WembWembW{emb} is the word embedding matrix with embedding size eee. By restricting the projection matrix to be of this form, we reuse syntactic and semantic information about each token from the embedding matrix and reduce the number of new parameters in WsmallprojWsmallprojW{small\_proj} to e⋅h=128∗400≈50Ke⋅h=128∗400≈50Ke \cdot h = 128 * 400 \approx 50\text{K}.

w_small_proj = tf.get_variable(
    shape=[embedding_dim, decoder_hidden_dim],
# shape [decoder_hidden_dim, vocab_size]
w_proj = tf.matmul(embedding, w_small_proj)
b_proj = tf.get_variable(name='b_proj', shape=[vocab_size], dtype=tf.float32)

# shape [vocab_size]
output_prob_dist = w_proj * decoder_output + v

Cumulative impact on training time

Using these tricks (and a few more), we can compare the training of the original and new versions side by side. In the new version, during training we again increase the token size of the training articles from 150 to 400, this time at iteration 8000. Now we reach similar loss scores in under 20K iterations (just over one epoch!), and in total time of around 6 hours.




Comparison of training progress between original and improved versions

Reusing previously trained models

At times, we can also reuse a previously trained model instead of retraining one from scratch. If we are experimenting with certain model modifications, we can initialize the values of the variables to those from earlier iterations. Examples of such modifications include adding a loss to encourage the model to be more abstractive, adding additional variables to utilize previous attention values, or changing the training process to use scheduled sampling. This can save us a ton of time by jumping straight to a trained model instead of starting from randomly initialized parameters.

We can reload previous variables with the below idea, assuming we have kept the variable names consistent from model to model.

# Define the placeholders, variables, model graph, and training op

# Initialize all parameters of the new model
sess = tf.Session()

# Initialize reusable parameter values from the old model
saver = tf.train.Saver([v for v in tf.global_variables() if is_reusable_variable(v)])
saver.restore(sess, old_model_checkpoint_path)

Evaluation speed

How long does our model take to summarize one article? For deployment, this becomes the key question instead of training time. When generating summaries, we encode 400 article tokens, and decode up to 100 output tokens with a beam search of size 4. Running it on a few different hardware configurations, we find:

Hardware Time per summary
CPU* (single core) 17.2s
CPU* (4 cores) 8.1s
GPU 1.5s

* TensorFlow not compiled from source (which may increase runtimes by 50%)

Unsurprisingly, the GPU really helps! On the other hand, needing to run the algorithm at scale using CPUs looks like a very challenging task.

Looking more closely, we see that with one CPU, running the encoder takes about .32 seconds and running each decoding step (which, in beam search, extends each of beamsize_ partial summaries by one token) takes about .18 seconds. (The rest of the time not spent inside a TensorFlow session is less than a second). Looking even further at the trace of one decoding step, we find:




Timing breakdown of a decoder step. Note: in this analysis, we have already precomputed the product for the W_proj matrix, so that it is not recomputed at every step.

Interestingly, nearly all of the steps before the final “MatMul” involve computations for the attention distribution over the article words. As the vast majority of the original 17 seconds are spent doing numerical computations, we might need to resort to hardware upgrades of using GPU / more CPUs / compiling TensorFlow from source in order to significantly improve performance.


Having iterated over and analyzed these models, here are some of our thoughts on the state of developing seq-to-seq models today.

First, despite our efforts (or perhaps as justification for our efforts), the time needed to evaluate new ideas is quite long. We tried to speed up training by reducing the size and dimensions of our models and datasets, but that only goes so far. Scaling up the number of computers for a training task can be difficult for both mathematical and engineering reasons. Additionally, to evaluate a new version of the model, we had to manually inspect the results against a common set of evaluation articles, since we did not have a good metric for summarization. One consequence of slow iteration cycles is that there isn’t much flexibility to do extensive hyperparameter searches. It’s great when we can use the literature to guide our choices here.

Testing out new ideas is made even harder by how challenging it is to correctly implement them. Many deep learning concepts are newly added, updated, or deprecated in TensorFlow, resulting in incomplete documentation. It’s very easy to make implementation mistakes with subtle side effects; some errors don’t trigger exceptions, but instead compute values incorrectly and are only discovered when training doesn’t converge properly. We found it useful to manually check inputs, intermediate values, and outputs for most changes we made. In managing our code, we found it important to create flags for every new model variant and keep each commit compatible with older running experiments.

We were impressed that the deep learning model could train so easily. Given just one possible summary for each article, the model could learn millions of parameter weights, randomly initialized, from the embeddings to the encoder to the decoder to attention to learning when to copy. We’re excited to continue to work with seq-to-seq models as state-of-the-art solutions to NLP problems, especially as we improve our understanding of them.


Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. “Get to the Point: Summarization with Pointer-Generator Networks.” CoRR abs/1704.04368.

Too long, didn’t read.

When confronted with pages of long, hard-to-read text, we zone out and skip reading. Think back to the last book-worthy email your coworker sent out or that dense opinion article that your friend shared. These long texts are easy to ignore; in today’s internet-driven world, we want information without exertion.

But sometimes we want to know what a long document is saying. And sometimes, if you’re (un)lucky, you have to. Analysts across many industries need to read stacks of documents (news articles, government cables, law records), organize them mentally, and be able to re-communicate their findings.

Consider the following CNN article from 2016. It would be amazing if we could just read a perfect summary.

(CNN) — How did manage to build a tablet computer for less than the $199 price tag? According to analysts’ estimates, the company couldn’t. In order to meet the low asking price, Amazon will sell the Kindle Fire at a loss when it debuts on November 15, analysts said. The Internet’s largest retailer is apparently aiming to make up the costs by selling other goods, as it and other retail giants such as Wal-Mart Stores often do. Each Fire costs $209.63 to build, which includes paying for parts and manufacturing, according to technology research firm IHS iSuppli. That amount does not factor in research-and-development investments or marketing, executives for the firm have said. The preliminary estimate …

By algorithmically generating summaries of text documents, we could always enable readers to quickly learn the main ideas.

Building a human-level summarizer would require a foundational improvement in artificial intelligence. Our algorithms need to learn how to both understand a piece of text and produce its own natural language output. Both tasks are hard to achieve today: understanding text requires identifying entities, concepts, and relationships, as well as deciding which ones are important, and generating language requires converting the learned information into grammatical sentences.

Although these are lofty goals in AI, we have seen recent scientific developments start to close the gap between human and machine. Let’s take a look at the approaches to algorithmic summarization today.

Overview of summarization methods

For the past 10+ years, the most effective method for summarizing text was an algorithm called Lexrank. The algorithm pulls out individual sentences from a document, runs them through a mathematical scoring function that estimates how central each sentence is to the document, and selects a few of the highest-scoring sentences as the summary for the document.

LexRank belongs to the class of extractive methods, which use existing sentences from text to form the output summary. Extractive methods can utilize per-sentence features like word counts, sentence position, sentence length, or presence of proper nouns to score sentences. They can also use algorithms like hidden Markov models, integer linear programming, or neural networks to select optimal sets of sentences, sometimes in a supervised setting. While these methods produce safe and reasonable results, they are greatly restricted from expressing high-level ideas that span the whole text.

The most promising direction for summarization in the last two years has been deep learning-based sequence-to-sequence systems. These methods are abstractive, meaning they compose new phrases and sentences word-by-word, and they draw inspiration from the human process of summarization:

  1. Read a document and remember key content.
  2. Generate a shorter account of the content word-by-word.

The seq-to-seq systems have one component that reads in an input text and produces its own “representation” or “understanding” of the input, and one component that uses this representation to generate an abstractive summary. Notable papers developing these methods for summarization include those by Rush (2015), Nallapati (2016), See (2017), and Paulus (2017). We’ll now see how to intuitively derive and construct a seq-to-seq model for summarization.

Designing a seq-to-seq model

Deep learning uses neural networks to learn complex relationships. In our case, we need to elegantly handle arbitrary length inputs and outputs, which traditional feed-forward networks can’t easily do. So, we’ll want to use a particular deep learning construct called recurrent neural networks, or RNNs.

An RNN can learn patterns from sequential data by processing inputs one by one. It maintains an hidden state, or vector of real numbers, that represent what it has learned so far. At each step


, it uses its previous hidden state, denoted as

si−1s_{i – 1}

, and the current input


, to produce the new hidden state


. In that way, the state is a function of all previous inputs, and accumulates knowledge about all the inputs it has seen. Additionally, an RNN is able to generate an output


using its state






The variables

xi,six_i, s_i

, and


are typically vectors. In natural language processing, an RNN reads in words one-by-one in the form of word embeddings


. Each word in an English vocabulary has its own embedding, or vector of numbers, that characterizes the word’s syntactic and semantic properties. By receiving a vector representation of each word instead of just the word itself, the RNN can better understand each input word. See this post for more detail about word embeddings.

Additionally, the functions




that produce the next state and output are usually neural network layers, hence the name recurrent neural network. In the following sections, we’ll denote (one-layer) neural networks as

y=f(Wx+b)y = f(Wx + b)

, where




are the input and output,


is an activation function, and




are the linear parameters of the layer.

Building the model

So how can we use these building blocks to solve our problem? We can use one RNN to read in a document (the encoder), and one RNN to generate a summary (the decoder).


The encoder reads the input document one word embedding at a time, producing a sequence of its own hidden state vectors. The RNN learns which “features” from the inputs are useful to keep track of. For instance, it could use one value in its state vector to track that the input is talking about tech gadgets instead of politics, and another to track whether the current input word is important. This corresponds to how a human might read and internalize a piece of text before beginning to formulate an output.




Note that the encoder could come across words that do not belong to its fixed vocabulary (e.g. a person’s name), in which case we feed it the embedding for the special unknown token “[UNK]”.


We can then hand off the state vectors from the encoder to the decoder, whose job is to produce a coherent sequential output. When coming up with each word, a human might think back to which parts of the input were important enough to refer to again. So, we’ll also allow the decoder to decide which input words to focus on – that is, decide on an attention probability distribution over input words. The attention paid to each word


is a function of the encoder’s state at the word


and the decoder’s state







ei=vattnTtanh⁡(Whhi+Wss+battn)e_i = v_{attn}^T \tanh (W_h h_i + W_s s + b_{attn})

a=softmax(e)a = \text{softmax} (e)

In the example, the decoder is ready to say what product Amazon is releasing, and so concentrates on the word “Kindle”.

Now we are ready to generate the next word. The decoder can use its own state


and an attention-weighted average of the encoder’s states


, to identify top candidates for the next output word. In particular, the decoder outputs a probability distribution


over its fixed vocabulary of words. From the distribution, we select the next word in the summary, and that word is fed back to the decoder as the input for the next time step.





h∗=∑iaihih^* = \sum_i a_i h_i

Prgen=softmax(V2(V1[h∗,s]+b1)++b2)\text{Pr}_{gen} = \text{softmax} (V_2 (V_1[h^*, s] + b_1)^+ + b_2)

In our example, the decoder knows it is ready to output a noun (based on its hidden state) that is similar to “Kindle” (based on the weighted-average encoder’s state). Perhaps “tablet” is a good candidate word in its vocabulary.

This seems like a pretty intuitive yet potentially powerful algorithm! The model can use its own comprehension of the text to flexibly generate words and phrases in any order, including those that don’t appear in the original text. The steps of the algorithm are outlined in the code below.

encoder = RNN(encoder_hidden_dim)
encoder_states = []

# compute encoder states
for word in document:

decoder = RNN(decoder_hidden_dim)
summary = []

# generate one word per loop below
while True:
    # compute attention distribution
    attention_levels = [
            tanh( + + b_attn)
        for encoder_state in encoder_states
    # normalize the attention to make it a probability distribution
    attention_levels = softmax(attention_levels)
    # compute weighted-average encoder state
    weighted_encoder_state =
    # generate output word
    output_word_probs = softmax(
            relu(, weighted_encoder_state)) + b_1)
        ) + b_2
    output_word = argmax(output_word_probs)

    if output_word == STOP_TOKEN:
        # finished generating the summary

    # update decoder with generated word


Having defined how the model should work, we’ll need to learn useful parameters for the model based on training examples. These examples should contain (likely human-generated) gold-standard summaries of representative documents. In the domain of news articles, there is luckily a publicly available data set from CNN and Daily Mail of about 300,000 pairs of articles and human-generated summaries.

We train the model by showing it examples of articles, and teach it to generate the example summaries. In particular, we can sequentially feed to the decoder a summary word


and ask it to increase the probability of observing the next summary word

wt+1∗w_{t + 1}^*

. That would give us the loss function


loss=−∑tlog⁡Prgen(wt∗)\text{loss} = – \sum_t \log \text{Pr}_{gen} (w_t^*)

The promise of deep learning is that if we show the model enough examples, it will automatically figure out values for its (millions of) parameters that will produce intelligent results.

for article, summary in dataset:
    model.train(article, summary)

Early results

A few thousand lines of actual Tensorflow code, an AWS p2 instance with GPU, and several days of training later, we can finally run:


on our previous example article. The output summary is:

Amazon will sell a tablet computer for less than

100million.Thecompanycouldmakeupforthelosseswithdigitalsalesorphysicalproducts.[UNK][UNK]:thecompanyislikelytolose100 million. The company could make up for the losses with digital sales or physical products. [UNK] [UNK]: the company is likely to lose50 on each sale of the Fire.

That’s something! It’s impressive that the model learned to speak English and on-topic about Amazon, but it makes a few mistakes and produces some “unknown” tokens (“[UNK]”).

Improving reliability

While our above work is a great starting point, we need to make sure our results are more reliable.

Looking back at our generated synopsis, we notice that the model had trouble remembering the price of the tablet. In fact, that shouldn’t be surprising – after passing information through all the layers of networks, including the encoder, the attention mechanism, and the generation layer, it is really hard to pinpoint which number in the output vocabulary is closest to the correct one. In fact, we can’t actually report the real price of the tablets – the number ($199) is not even in our vocabulary! Similarly, the model replaces “Kindle” with “tablet computer” and the analyst “Gene Munster” with “UNK” because it isn’t able to produce those words.

Maybe we should allow our model to copy existing words in the input to produce in our output, just as a human may need to look to revisit an article to recall specific names or numbers. Such a pointer-copier mechanism See et al., 2017 would allow the model to shamelessly borrow useful phrases from the input, making it easier to preserve correct information (as well as more easily produce coherent phrases). The model can stitch together phrases from different sentences, substitute names for pronouns, replace a complicated phrase with a simpler one, or remove unnecessary interjections.

One simple way to allow for copying is to determine a probability


of copying from the input at each step. If we decide to copy, we copy an input word with probability


proportional to its attention.





pcopy=σ(wh∗Th∗+wsTs+bcopy)p_{copy} = \sigma (w_{h^*}^T h^* + w_s^T s + b_{copy})

Prcopy(w)=∑iaiI[wi=w]\text{Pr}_{copy} (w) = \sum_i a_i I[w_i = w]

In the example, we might have a high probability of copying and be likely to copy “Kindle” since our attention is highly focused on that word.

Testing out this new model, we get the following summary:

Amazon will sell the Kindle Fire at a loss when it debuts on November 15. The internet ‘s largest retailer is aiming to make up the costs by selling other goods. Each Fire costs $209.63 to build, which includes paying for parts and manufacturing.

Brilliant – we can now reproduce specific facts from the article! The copy mechanism also helps with content selection, since we can more confidently express difficult ideas that we may have avoided earlier.

By reasoning through how humans think, we’ve essentially deduced a framework for the state-of-the-art methods in summarization.

Sample results

Our trained model can summarize news articles,

(CNN) — Inaccurate, inconvenient, ill-conceived … now add “potentially life-threatening” to the list of words being used to describe flaws in Apple’s much maligned maps app. Police in Mildura, Australia are warning drivers to be careful about using Apple Maps to find the city, which the app has placed more than 40 miles (70 kilometers) away in the Outback. Calling it a “potentially life-threatening issue,” police say the mapping system lists Mildura, a city of 30,000 people, as being in the middle of Murray-Sunset National Park. Several motorists have had to be rescued by police from the park, which police say has no water supply and where temperatures can reach a blistering 46 degrees Celsius (114 Fahrenheit). “Some of the motorists located by police have been stranded for up to 24 hours without food or water and have walked long distances through dangerous terrain to get phone reception,” Mildura police said in a statement. “Police have contacted Apple in relation to the issue and hope the matter is rectified promptly to ensure the safety of motorists travelling to Mildura…

Police in Mildura, Australia are warning drivers to be careful about using Apple Maps to find the city. The app has placed it more than 40 miles away in the Outback. Police have contacted Apple in relation to the issue and hope the matter is rectified promptly.

scientific paper abstracts,

The CRISPR (clustered regularly interspaced short palindromic repeat)-Cas9 (CRISPR-associated nuclease 9) system is a versatile tool for genome engineering that uses a guide RNA (gRNA) to target Cas9 to a specific sequence. This simple RNA-guided genome-editing technology has become a revolutionary tool in biology and has many innovative applications in different fields. In this review, we briefly introduce the Cas9-mediated genome-editing method, summarize the recent advances in CRISPR/Cas9 technology, and discuss their implications for plant research. To date, targeted gene knockout using the Cas9/gRNA system has been established in many plant species, and the targeting efficiency and capacity of Cas9 has been improved by optimizing its expression and that of its gRNA. The CRISPR/Cas9 system can also be used for sequence-specific mutagenesis/integration and transcriptional control of target genes. We also discuss…

We briefly introduce the Cas9-mediated genome-editing method, summarize the recent advances in CRISPR/Cas9 technology, and discuss their implications for plant research. The CRISPR/Cas9 system is a versatile tool for genome engineering that uses a guide RNA to target Cas9 to a specific sequence.

and even declassified government cables,

When political and economic storm clouds gather in Latin America, it is ineviakble that there will be talk of a military coup d’etat. El Salvador is no exception to this rule. Since president Romero assumed office on July 1, 1977, we have heard reports and rumors of unrest within the military establishment. some of this has obviously been the chronic grumbling of off-duty soldiers. Some has been attributed to hard-line senior officers supposedly dissatisfeid with president Romero’s moderation and failure to maintain “law and order”. And some has supposedly represented the thinking and dissatisfaction of younger officers dismayed by evidence of cooruption in high places and persuaded of the need for some political and economic structural reform. Sectors of the civilian population becoming aware of this real or reported coup d’etat sentiment react differently depending upon the inherent bias of each and its susceptibility to wish ful thinking…

We have heard reports and rumors of unrest within the military establishment in Latin America. Some of this has been attributed to hard-line senior officers supposedly dissatisfeid with president Romero’s moderation and failure to maintain “law and order”. Although the embassy has been cautious in attaching effective importance to this continuing volume of “coup talk”, we are now coming to the conclusion that more serious attention should be given it.

Interestingly, in the last example, the model copied the incorrectly-spelled word “dissatisfeid”. It must have decided that it was safer to reuse this unknown word rather than generate its own!

It’s also often useful to look at less successful examples:

A top European Union official said Saturday that new sanctions will be proposed within a week against Russia over its actions in Ukraine, but Ukrainian President Petro Poroshenko sounded like he can’t wait that long. “I think we are very close to the point of no return,” Poroshenko said Saturday following an EU summit in Brussels. “The point of no return is full-scale war, which already happened in the territory controlled by separatists and where — instead of separatists — there are regular Russian troops.” Poroshenko said the situation had worsened in the last few days and that thousands of foreign troops and hundreds of foreign tanks are now on Ukrainian territory. Russia has repeatedly denied either supporting the rebels or sending its own troops over the border. But those assertions have been roundly rejected by the West…

New sanctions will be proposed within a week against Russia over its actions in Ukraine. Russia has repeatedly denied either supporting the rebels or sending its own troops over the border. “I want to remind you that Russia is one of the most powerful nuclear Nations,” the President says.

In this summary, we get a quote from an unnamed president. While the model is able to substitute names when it identifies who is being referenced, it did not do so in this case. Another weakness of the model is that it often copies long phrases or even entire sentences from the source, having trouble coming up with its own wording. It turns out that true original language generation is really hard!

In this post, we’ve brainstormed how a good summarization algorithm should work. Stay tuned for our [next post]({{ site.baseurl }}{% post_url 2017-09-26-seq2seq %}) that will discuss the practical steps and challenges behind training a deep seq-to-seq summarizer.


Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. “Get to the Point: Summarization with Pointer-Generator Networks.” CoRR abs/1704.04368.

皇帝 + 女人 – 男人 → 武则天
Emperor (or king) + Woman – Man → Zetian Wu

In natural language processing, words can be thought of as vectors in a low-dimensional space. This allows you to manipulate language to discover relations that are otherwise difficult to discover. A famous example is to add the vector for ‘woman’ to the vector for ‘king’ and then subtract ‘man’, yielding the vector for ‘queen’. Although they have different vocabularies and grammar rules, all human languages are fundamentally similar. So it should be possible to use this vector-based strategy in any of them. In this blog, you will see how to get the Chinese version of king + woman – man → queen.

In the next section, we give an overview of word vectors. Feel free to skip it if you are familiar with the concept. Next, we show how to train Chinese word vectors using Gensim. We then show examples of Chinese word vectors including the Chinese version of king + woman – man → queen. We end with a brief discussion of how to choose Chinese word vectors.

Word Vectors

An easy way to represent words in vectors is to use one-hot encoding. This method uses a one-to-one mapping from each word in the vocabulary to an entry in a sparse vector matrix. A word is represented as all 0s with a 1 at the corresponding location. For example, let’s say you have a corpus with only six words: king, queen, man, woman, smart and intelligent. (Example borrowed from blog The amazing power of word vectors with slight modification.) The one-hot representations are:


Chinese-Word-Vectors 1


Due to its simplicity, one-hot word representation has been widely adopted in natural language processing (NLP). The main drawback is that it doesn’t take into account any semantic relations between words. For example, you couldn’t tell if ‘smart’ is similar to ‘intelligent’, or how ‘king’ and ‘queen’ might be related.

To solve the problem, word embedding (or distributed word representation) encodes the semantic meaning of a word as a low-dimensional vector. Intuitively, each entry in the vector contributes to a certain aspect of the word’s definition. For example, you could represent “king, queen, woman, man, smart, and intelligent” with a 3-dimensional vector:


Chinese-Word-Vectors 2


(Note: the numbers are chosen for illustration purpose only)

By comparing the vectors (i.e., using cosine similarity), you can determine that ‘smart’ is very similar to ‘intelligent’. You can also determine a relationship between ‘king’ and ‘queen’ :

= [0.9,−0.9,0.5]+[0.1,0.9,0.5]−[0.1,−0.9,0.5]
= [0.9,0.9,0.5]
= queen

By capturing these semantic relationships, word embeddings are highly useful for NLP applications such as translation systems, text generation, and semantic analysis [1].

Word Vector Training Tools

Word vectors sound amazing, but how do you create them? They must be trained with a large-scale text corpus. There are two popular training algorithms: Google’s word2vec [1] and Stanford’s GloVe [2]. Both are unsupervised learning methods that capture words’ semantic properties by considering their co-occurrence with other words. You can download pre-trained English word vectors from the algorithms’ websites: word2vec or GloVe. For other languages, Facebook [5] published word vectors for 294 languages (including Chinese and Russian), trained on Wikipedia.

Although pre-trained word vectors are available, sometimes you may want to train your own. For example, you may have a text corpus in a different language, or you may want to perform some special pre-processing on a corpus in advance.

So, let’s look at some code for word vector training. You can use a free Python library Gensim to train, save and load word vectors. Let’s try Google’s word2vec as the training algorithm.

from gensim.models import Word2Vec

# Train your model! Set the word vector dimensionality as 300 (size)
# and ignore the words that appear less than 5 times (min_count).
model = Word2Vec(sentences, size=300, min_count=5)

# Enable saving and loading of your trained model.'wv')
model = Word2Vec.load('wv')

(See Gensim’s API and this blog for available parameters)

In the above code, the input is a sequence of sentences. Every sentence is represented as a list of words. You can pre-process the sentences any way you want. For example, given a corpus: “A queen is a woman. A king is a man.”, you might get :

sentences = [['a', 'queen', 'is', 'a', 'woman', '.'],
             ['a', 'king', 'is', 'a', 'man', '.']]

If you choose to remove stopwords and punctuations, then you get:

sentences = [['queen', 'woman'], ['king', 'man']]

Chinese Word Vectors

How do you train Chinese word vectors? Luckily, you can use the very same algorithm. The challenge lies in dividing a Chinese sentence into smaller, coherent units. A natural approach in English is to work at the level of individual words. How about in Chinese? To illustrate the problem, let’s take a look at a quote from Harry Potter and the Prisoner of Azkaban in Chinese:

Happiness can be found, even in the darkest of times, if one only remembers to turn on the light.
(sentence s1)

On the English side, you generate the word list simply by splitting the sentence on spaces. But in written Chinese, there are no spaces between words! Even stranger, and very unlike English letters, Chinese characters have their own meaning, which is crucial for encoding the meaning of the sentence.

Chinese Segmentation

To segment a Chinese sentence into even smaller meaningful units, you can split the text on characters, treating each one as an element in the vocabulary. For example, “即使在最黑暗的日子,” in sentence s1 can be divided into:


Chinese-Word-Vectors 3


This method is straightforward to implement, but it has a fatal flaw. Chinese characters can take on completely different meanings in different contexts. For example, the character 子 in the word 儿子 means son, while 子 in the word 日子 is a noun suffix with no specific meaning. The character 点 in the word 点亮 means to turn on (the light). But 点 in 一点儿 means a little or a few. Therefore, many of the resulting word vectors would lose their meaning.

A far better strategy for Chinese segmentation is to do what Chinese speakers do: Break the sentence down into the words that have specific semantic meanings. For example, “即使在最黑暗的日子,” in the Harry Potter sentence s1
can be segmented as follows:


Chinese-Word-Vectors 4


To segment Chinese text into meaningful words, you can use Jieba (结巴分词), an MIT-licensed, easy-to-use Python tool. To split the sentence, do this:

import jieba
sentence = u'即使在最黑暗的日子,幸福也是有迹可循的,只要你记得为自己点亮一盏灯。'
words = jieba.lcut(sentence)

Jieba makes use of a large Chinese word dictionary to segment Chinese text. It also uses HMM (Hidden Markov Model) and the Viterbi algorithm to capture new words. It gets 0.92 and 0.93 f-scores on datasets from Peking University and City University of Hong Kong, respectively. There are other methods for achieving even better performance. To learn more, check out more recent algorithms for Chinese segmentation [6,7].

Chinese Word Embeddings

With different segmentation methods, you get different inputs to the word2Vec algorithm. Consequently, you get different Chinese embeddings. Splitting Chinese text into a list of characters gets us to character embeddings, while semantic segmentation yields word embeddings.

With character embeddings, every Chinese character is encoded as a vector. So if a character has similar meanings in different words, this encoding makes sense. For example, the character 智(wisdom) has similar meanings in the words 智能 (intelligent), 智慧 (wisdom), and 智商 (intelligence quotient). But this is a dangerous assumption. As we discussed earlier, the character 点(point) takes on very different meanings in 点亮 (turn on/ignite) and 一点儿 (a little). Mixing multiple definitions in one vector will cause you pain.

Word embeddings solve this mixed-meaning problem by encoding every word to one vector — as compared to a character, a word usually doesn’t have multiple meanings. The downside of word embeddings is that it doesn’t make use of characters having the same meaning across words. So is it possible to take advantage of both facets of Chinese script?

It is. To capture characters that are part of many words, you can use char+position embedding — a compromise between word and character embeddings [3,4]. With char+position embeddings, every element in the vocabulary is a character with a position tag. The tag represents the character’s position in a word. For example, 点0 refers to 点 when it appears at the first position of words, whereas 点1 refers to 点 occurring at the second position of words. You can get character positions by word segmentation. “即使在最黑暗的日子,” in the sentence si then would be segmented as:


Chinese-Word-Vectors 5


Char+position embedding assumes that a character has similar meanings when it appears at the same position in words. For example, 点 in both 点亮 and 点燃 means turn on/ignite. It’s not always true, but it helps reduce the word encoding to a much smaller set of character-position combinations.


Recall that to train word vectors, you need sentences as input. Character, word, and char+position embeddings require different methods to generate sentences. Here is a code snippet to segment a sentence for the three embeddings:

import jieba

def get_chinese_characters(sentence):
    """Return list of Chinese characters in sentence. """
    return list(sentence)

def get_chinese_words(sentence):
    """Return list of Chinese words in sentence. """
    return jieba.lcut(sentence)
def get_chinese_character_positions(sentence):
    """Return a list of characters with their positions in the words. """
    return [u'{}{}'.format(char, i)
            for word in get_chinese_words(sentence)
            for i, char in enumerate(word)]

Note that the above code is a very basic implementation to obtain a list of characters and words. You can do better by performing pre-processing such as grouping alphabets to English words, normalizing text, and removing stopwords, punctuations, and spaces.

Finally, the following code snippet shows how to train different Chinese word embeddings using Gensim.

from argparse import ArgumentParser
from gensim.models import Word2Vec

class SentenceGenerator(object):
    def __init__(self, f, encoding='utf-8', mode='word'):
        self.f = f # file-like object
        self.encoding = encoding
        self.mode = mode

    def __iter__(self):
        for line in f:
            sentence = line.decode(self.encoding)
            if self.mode == 'word':
                # Return a list of words
                yield get_chinese_words(sentence)
            elif self.mode == 'char':
                # Return a list of characters
                yield get_chinese_characters(sentence)
            elif self.mode == 'char_position':
                # Return a list of char+positions.
                yield get_chinese_character_position(sentence)
if __name__ == '__main__':
    parser = ArgumentParser()
    parser.add_argument('--encoding', default='utf-8')
    parser.add_argument('--mode', default='word')
    args = parser.parse_args()

    with open(args.filename, 'rb') as f:
        sentences = SentenceGenerator(f, args.encoding, args.mode)
        model = Word2Vec(sentences, size=300) # Train model.'test_' +args.mode) # Save model.

Chinese Word Vectors in Action

In early tests of Chinese natural language processing at Primer, we trained those three types of word embeddings on more than 3 million simplified Chinese news articles published in June 2017 (10 GB). The training of the three embeddings took about two days (on a Mac with 16GB memory and 3.3 GHz Intel Core i7 processor). We set the vector dimension as 300 (same as the dimension in Facebook’s Chinese word vectors).

For comparison purposes, download Chinese word vectors published by Facebook — these word vectors use word embeddings and they are trained on Chinese Wikipedia, including both simplified and traditional Chinese.

Load the pre-trained word vectors.

from gensim.models import Word2Vec

# Word vectors
word_model = Word2Vec.load('wv_word')
char_model = Word2Vec.load('wv_char')
pos_model = Word2Vec.load('wv_char_position')

# Facebook's Chinese word vectors.
fb_model = KeyedVectors.load_word2vec_format('wiki.zh.vec')

Let’s look at some examples.

狗 + 小猫 – 猫 → ?
dog + kitten – cat → ?

# The most similar word with vector (狗 + 小猫 - 猫)
word_model.most_similar(positive=[u'狗', u'小猫'], negative=[u'猫'])
fb_model.most_similar(positive=[u'狗', u'小猫'], negative=[u'猫'])

In the function most_similar, the similarity between vector A and vector B is measured by cosine similarity:


cos(A,B)=∑i=1nAiBi∑i=1nAi2∑i=1nBi2cos(A, B) = \frac{\sum_{i=1}^n A_i B_i}{\sqrt{\sum_{i=1}^n A_i^2} \sqrt{\sum_{i=1}^n B_i^2}}

Using the trained word embeddings, you get 小狗 (puppy) as the most similar word. Not bad! Facebook’s fb_model returns 小脸 (little face). 小狗 (puppy) is ranked at the sixth position.

Could you get 小狗 (puppy) with character embeddings? In fact, with character embeddings, 狗 + 小猫 – 猫 (dog + kitten – cat) always equals 小狗 (puppy), because 狗 + 小 + 猫 – 猫 = 小 + 狗 is always true!

As 小狗 (puppy) includes two characters, it’s not in the char+position vocabulary. Instead of finding the most similar words, you could compare the vectors of 狗 + 小猫 – 猫 (dog + kitten – cat) and 小狗 (puppy) using cosine similarity:

from sklearn.metrics.pairwise import cosine_similarity

# cosine similarity between 狗 + 小猫 - 猫 and 小狗
cosine_similarity(pos_model.wv[u'狗0'] + pos_model.wv[u'小0'] +
                  pos_model.wv[u'猫1'] - pos_model.wv[u'猫0'],
                  pos_model.wv[u'小0'] + pos_model.wv[u'狗1'])

The cosine similarity is 0.64 — this means 狗 + 小猫 – 猫 (dog + kitten – cat) and 小狗 (puppy) have quite similar vectors.

Ok let’s look at another example:

皇帝 + 女人 – 男人 → ?
emperor (or king) + woman – man → ?

# the most similar word with (皇帝 + 女人 - 男人)
word_model.most_similar(positive=[u'皇帝', u'女人'], negative=[u'男人'])
fb_model.most_similar(positive=[u'皇帝', u'女人'], negative=[u'男人'])

Using Facebook’s word embeddings, the most similar word is 帝(emperor) which doesn’t look like a good result. Why don’t Facebook’s Chinese word vectors perform as expected? The possible reason is that these word vectors are trained on Chinese Wikipedia, which includes only 961,000 articles on Sep 10, 2017 (compared to 5,475,693 English Wikipedia pages).

Can you guess which word you get using the word_model? Not 皇后(queen) nor 女皇(empress). You get 武则天(Zetian Wu). This is a little surprising but does make sense, considering that when one says ‘empress’ in China, one usually means Zetian Wu, the only empress in Chinese history. Fair enough.

However, 皇帝 + 女人 – 男人 (emperor + woman – man) vector and 武则天(Zetian Wu) vector are not very similar using character embeddings and char+position embeddings. The reason is that characters 武(military), 则(then), and 天(day or sky) have very different meaning from the whole word 武则天. This often happens when the word is a named entity. Therefore, one way to improve character and char+position embeddings is to not separate named entities as characters [4].

# cosine similarity between 皇帝 + 女人 - 男人 and 武则天
cosine_similarity(char_model.wv[u'皇'] + char_model.wv[u'帝'] +
                  char_model.wv[u'女'] - char_model.wv[u'男'],
                  char_model.wv[u'武'] + char_model.wv[u'则'] + char_model.wv[u'天'])
# output score is 0.28

cosine_similarity(pos_model.wv[u'皇0'] + pos_model.wv[u'帝1'] +
                  pos_model.wv[u'女0'] - pos_model.wv[u'男0'],
                  pos_model.wv[u'武0'] + pos_model.wv[u'则1'] + pos_model.wv[u'天2'])
# output score is 0.26

The Bottom Line

What is the best strategy when choosing among one-hot encoding, character embedding, word embedding, and char+position embedding for Chinese NLP applications? When you have a large amount of data for training, one-hot encoding should perform well. However, it is rare that you will be swimming in training data. In the more typical data-sparse scenario, embeddings show their power. For example, if you encounter ‘smart’ but not ‘intelligent’ in training data, you still know how to handle ‘intelligent’.

Word embedding usually perform best in downstream NLP tasks. But in some cases, char+position embedding can do better: Peng and Dredze [3] showed that char+position embedding performs best in a Chinese social media named entity recognition task. Also, character embedding has been proved to be useful in a Chinese segmentation task [7]. Whichever you choose, pay close attention to the latest developments in multi-lingual natural language processing. This is a fast-changing field!


[1] Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In arXiv preprint arXiv:1301.3781.
[2] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP).
[3] Nanyun Peng and Mark Dredze. 2015. Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings. In Empirical Methods in Natural Language Processing (EMNLP).
[4] Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, and Huanbo Luan. 2015. Joint learning of character and word embeddings. In International Joint Conference on Artificial Intelligence (IJCAI).
[5] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching Word Vectors with Subword Information. In arXiv preprint arXiv:1607.04606.
[6] Wenzhe Pei, Tao Ge, and Baobao Chang. 2014. Max-Margin Tensor Neural Network for Chinese Word Segmentation. In Annual Meeting of the Association for Computational Linguistics (ACL).
[7] Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and Xuanjing Huang. 2015. Long Short-Term Memory Neural Networks for Chinese Word Segmentation. In Empirical Methods in Natural Language Processing (EMNLP).

Further Readings