Information Theory#

The information_theory component calculates information theoretic measures derived from the text. These include:

{doc/span}._.entropy: the Shannon entropy of the text using the token.prob as the probability of each token. Entropy is defined as \(H(X) = -\sum_{i=1}^n p(x_i) \log_e p(x_i)\). Where \(p(x_i)\) is the probability of the token \(x_i\).
{doc/span}._.perplexity: the perplexity of the text. Perplexity is a measurement of how well a probability distribution or probability model predicts a sample. Perplexity is defined as \(PPL(X) = e^{-H(X)}\), where \(H(X)\) is the entropy of the text.
{doc/span}._.per_word_perplexity: The perplexity of the text, divided by the number of words. Can be considered the length-normalized perplexity.

These information theoretic measures are for example often used to describe the complexity of a text. The higher the entropy, the more complex the text is. Similarly, one could imagine filtering text based on the per-word perplexity given the assumption that highly surprising text are in fact non-coherent text pieces.

Note

The information theory components require an available lexeme prop table from spaCy which is not available for all languages. A warning will be raised and values set to np.nan if the table cannot be found for the language.

Usage#

import spacy
from textdescriptives as td
nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("textdescriptives/information_theory")

doc = nlp("This is a simple text")

# extract perplexity
doc._.perplexity

# extract entropy
doc._.entropy

# extract all metrics to a dataframe
td.extract_df(doc)

	text	entropy	perplexity	per_word_perplexity
0	This is a very likely sentence.	0.288195	1.334017	0.190574

Component#

textdescriptives.components.information_theory.create_information_theory_component(nlp: Language, name: str) → InformationTheory[source]#

Allows the InformationTheory component to be added to the spaCy pipeline using the command: nlp.add_pipe(‘textdescriptives/information_theory’)

It also set the following attributes on the document and span:

{Doc/Span}._.entropy: The shannon entropy of the document.
{Doc/Span}._.perplexity: The perplexity of the document.
{Doc/Span}._.per_word_perplexity: The per word perplexity of the document.
{Doc/Span}._.information_theory: A dictionary with the
keys: entropy, perplexity, and per_word_perplexity.

Parameters:

nlp (-) – The spaCy Language object.
name (-) – The name of the component.

Example

>>> import spacy
>>> import textdescriptives as td
>>> nlp = spacy.blank('en')
>>> nlp.add_pipe('textdescriptives/information_theory')
>>> doc = nlp('This is a sentence.')
>>> doc._.information_theory
{'entropy': ...