Machine Learning

Finding Everything New in the Harry Potter Universe

John BohannonPosted by John Bohannon

Applying machine learning at Primer to extract all the people, places, and things named in Harry Potter fan fiction books

We recently surpassed the AI research teams at Google and Facebook for the machine learning task of Named Entity Recognition (NER): Given a document, identify all of the people, places, organizations, and other things with proper names.

To stress test our new NER model we ran it on document types that it has never seen in its training. For an extreme test we turned to Harry Potter fan fiction novels. Because if our model can deduce that the Noble House of Potter is an organization, Phobos Malfoy is a person, and Snog Row is a location, then extracting the named entities from Facebook’s 10K filing or the Mueller Report should be a walk in the park.

With over 700,000 fan-written stories and counting, a large proportion of them book-length, Harry Potter fan fiction is one of the most richly described fantasy worlds of all time. How well does an NER model trained on news articles, financial reports, Wikipedia, and other diverse non-fiction text perform in a world of wizards?

Can we automatically extract all the characters, spells, and beasts dreamt up by Harry Potter fans that are new to that universe?

To build a corpus I turned to Namera Tanjeem's excellent top-50 ranking of Harry Potter fan fiction. It was easy to scrape content from the 42 books on her list that are hosted at fanfiction.net. This yielded 9 million words of text. Then it took 19 minutes on a single V100 GPU to extract and classify every mention of a named entity: all 467,168 of them.

To determine how many of these entities are unique and newly created by fans, I needed a knowledge base of canonical Harry Potter entities from the original novels. Luckily one already exists. The Harry Potter entity classes are somewhat different than traditional NER—they include "characters" and "locations" but also "beasts", "spells", and "potions"—but we can remap those. For example the 1351 canonical "characters" can be labeled as PER entities. The beasts, spells, potions, and other strange named entities fall in MISC.

Once I had the corpus and knowledge base ready, I used Primer's entity resolution and linking algorithms to identify all of the non-canonical entities. This yielded 2923 people, places, organizations, and things that are new to the Harry Potter universe.

One book in particular jumped out: Harry Potter and the Methods of Rationality. This single work produced 179 candidate non-canonical PERSON entities. That is how I discovered the prodigious creativity of its author, Eliezer Yudkowsky.

I knew Yudkowsy from his work on the philosophy of artificial intelligence, particularly his AI-Box experiment. I had no idea that he is also one of the most celebrated authors of Harry Potter fan fiction.

Rather than being raised by an ignorant Muggle family, Yudkowsy imagines Harry being adopted by a fictional Oxford University biochemist. While he is at Hogwarts studying wizardry, Harry approaches magical phenomena with the mental tools of rationality. The entities extracted by Primer's models include many real-world scientists such as Roger Bacon, Stanley Milgram, and Richard Feynman.

How fine that the tools of artificial intelligence should be used to discover and explore the imagined world of one of its own practitioners.

Here are Primer's top-10 lists of non-canonical named entities from the most popular Harry Potter fan fiction books:

PEOPLE (2088 total)
Centurion Harry Crow (Harry Crow)
Adam Mckinnon (The Life and Times)
Phobos Malfoy (Exitus Acta Probat)
Belinda Harper (Exitus Acta Probat)
Donna Shacklebolt (The Life and Times)
Estelle Black (Exitus Acta Probat)
Darian Mulciber (Amortentia)
Shelley Mumps (The Life and Times)
Sheep Shit (Mudbloods of the Death Eaters)
Marion Hinsley (Exitus Acta Probat)
LOCATIONS (235 total)
Harvest Lane (The Life and Times)
Crozer Street (Living with Danger)
Milky Way (Harry Potter and the Methods of Rationality)
Crow's Nest (Harry Crow)
Sector Twenty Nine (Midnight)
Solar System (Harry Potter and the Methods of Rationality)
Snog Row (Commentarius)
Blue Room (The Life and Times)
Gaer Penrhôs (Roundabout Destiny)
Greater Whinging (A Marauder's Plan)
ORGANIZATIONS (391 total)
Sunshine Regiment (Harry Potter and the Methods of Rationality)
Ladies Aide Society (Roundabout-Destiny)
Noble House of Potter (Harry Potter and the Methods of Rationality)
Justice League (Odd Ideas)
Walpurgis Incorporated (Nightingale)
Cow Party (The Pureblood Pretense)
Treasure Team (A Marauder's Plan)
7th-year Girls (Commentarius)
Crow's Marauders (Harry Crow)
Pythagoras School (Roundabout Destiny)
MISCELLANEOUS (209 total)
Aeternus Lapideus (Exitus Acta Probat)
Olivie Advent (Amortentia)
Pioneer 10 (Harry Potter and the Methods of Rationality)
Rebel North (Amortentia)
Fantasy Suites (Amortentia)
Libere Loqui (Exitus Acta Probat)
Progress in Potions (The Life and Times)
Constans Futuere (Odd Ideas)
Slaggy Boots (Commentarius)
Capillus Adversus (Exitus Acta Probat)