Their architecture is modular by design, allowing for continuous development on our analytic pipeline. These engines allow our customers to process a diverse set of document types across multiple languages. They do the work of extracting information, identifying key insights, performing analysis at scale, and generating output as human-readable text and graphics.
The first step in knowledge extraction is to identify all of the entities and structural data within a set of documents: the people, places, concepts, numbers, sentiment and quotes. A series of custom classifiers extract and resolve those entities and store them in a knowledge base. We then identify relationships between pairs of entities using unsupervised methodologies. Every piece of data that we capture retains its provenance, giving us full transparency on the decisions made by downstream algorithms.
EXAMPLE
Almost
modifier
57,000
number
Model S Vehicles
units
We construct models of reality based on streams of millions of documents. By de-duplicating and reconciling statements made by multiple observers, we create an ensemble version of the corpus. For any given event, there can be thousands of varying descriptions, from the people involved to the tiniest details. Taking a multi-document approach allows us to capture this variation as signal rather than noise. The multi-document approach improves performance metrics of the structuring engine compared with single document approaches.
EXAMPLE
doc_367
The tally of Rohingya who fled Myanmar into Bangladesh soared to over 300, 000 refugees
doc_612
At least 313,000 Rohingya have flooded into Bangladesh since August 25
doc_149
AI-Hussein said that more than 270,000 Rohingya refugees had fled to Bangladesh
This engine looks for evidence of real-world events based on a set of documents. It analyzes a set of structured data extracted from the documents. It is then able to cluster together entity relationships as a function of time. The result is a time-directed graph of inferred real-world events from any given corpus.
EXAMPLE
Apple teams up with China's WeChat to accept payments
Date:
August 29, 2017
Geo:
Beijing, China
Volume:
64 documents
Information is best understood in context with all the other information around it. The context engine can be used to analyze any claim, fact or assertion and identify any supporting evidence or any contradictions and return these to the user to better contextualize the information. On a larger scale, the context engine allows us to connect together events based on an inferred chain of probable causality. This allows us to see how a set of events is connected and evolves through time, and to additionally enable us to identify the origin and spread of information over time.
EXAMPLE
9 Sep 2017
Google Play removes Iranian apps from its store
25 Aug 2017
Authorities say Apple shuts down Iranian apps
25 Jul 2017
House passes sanctions bill against Russia, Iran, and North Karea
Differences between sets of information can be meaningful. These differences can be detected at multiple levels of resolution: sentence, document, and corpus. At the sentence level, a change in a single key word in a regulatory filing can be surfaced. At the document level, by diffing on structural data such as entities and factual claims, the engine can detect consensus and contradiction. Applying the diff engine across languages allows us to see events that are being covered by one country and not another.
EXAMPLE
Russian only
Russian & English
Unmanned vehicles will drive on roads without intersections
Medvedev promised to increase subsidies to developers of unmanned vehicles
Yandex Unveils Self-Driving Car Project
The most efficient means of communicating a complex analysis is through the combination of natural language narrative and graphics: a story. Our engines generate millions of statistical observations about entities and their relationships. We use a Bayesian model of surprise to rank these observations. The story emerges from a massive reduction in the dimensionality of these data and text generation via extractive and abstractive summarization. We are able to handle English, Chinese, and Russian as both input and output, with more languages on the way.