Information Theory#

The information_theory component calculates information theoretic measures derived from the text. These include:

  • {doc/span}._.entropy: the Shannon entropy of the text using the token.prob as the probability of each token. Entropy is defined as \(H(X) = -\sum_{i=1}^n p(x_i) \log_e p(x_i)\). Where \(p(x_i)\) is the probability of the token \(x_i\).

  • {doc/span}._.perplexity: the perplexity of the text. Perplexity is a measurement of how well a probability distribution or probability model predicts a sample. Perplexity is defined as \(PPL(X) = e^{-H(X)}\), where \(H(X)\) is the entropy of the text.

  • {doc/span}._.per_word_perplexity: The perplexity of the text, divided by the number of words. Can be considered the length-normalized perplexity.

These information theoretic measures are for example often used to describe the complexity of a text. The higher the entropy, the more complex the text is. Similarly, one could imagine filtering text based on the per-word perplexity given the assumption that highly surprising text are in fact non-coherent text pieces.

Note

The information theory components require an available lexeme prop table from spaCy which is not available for all languages. A warning will be raised and values set to np.nan if the table cannot be found for the language.

Usage#

import spacy
from textdescriptives as td
nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("textdescriptives/information_theory")

doc = nlp("This is a simple text")

# extract perplexity
doc._.perplexity

# extract entropy
doc._.entropy

# extract all metrics to a dataframe
td.extract_df(doc)

text

entropy

perplexity

per_word_perplexity

0

This is a very likely sentence.

0.288195

1.334017

0.190574


Component#

textdescriptives.components.information_theory.create_information_theory_component(nlp: Language, name: str) InformationTheory[source]#

Allows the InformationTheory component to be added to the spaCy pipeline using the command: nlp.add_pipe(‘textdescriptives/information_theory’)

It also set the following attributes on the document and span:

  • {Doc/Span}._.entropy: The shannon entropy of the document.

  • {Doc/Span}._.perplexity: The perplexity of the document.

  • {Doc/Span}._.per_word_perplexity: The per word perplexity of the document.

  • {Doc/Span}._.information_theory: A dictionary with the

    keys: entropy, perplexity, and per_word_perplexity.

Parameters:
  • nlp (-) – The spaCy Language object.

  • name (-) – The name of the component.

Example

>>> import spacy
>>> import textdescriptives as td
>>> nlp = spacy.blank('en')
>>> nlp.add_pipe('textdescriptives/information_theory')
>>> doc = nlp('This is a sentence.')
>>> doc._.information_theory
{'entropy': ...