Information Theory#

The information_theory component calculates information theoretic measures derived from the text. These include:

  • {doc/span}._.entropy: the Shannon entropy of the text using the token.prob as the probability of each token. Entropy is defined as \(H(X) = -\sum_{i=1}^n p(x_i) \log_e p(x_i)\). Where \(p(x_i)\) is the probability of the token \(x_i\).

  • {doc/span}._.perplexity: the perplexity of the text. perplexity is a measurement of how well a probability distribution or probability model predicts a sample. Perplexity is defined as \(PPL(X) = e^{-H(X)}\), where \(H(X)\) is the entropy of the text.

  • {doc/span}._.per_word_perplexity: The perplexity of the text, divided by the number of words. Can se considered the length normalized perplexity.

These information theoretic measures is for example often used to describe the complexity of a text. The higher the entropy, the more complex the text is. Similarly, one could imagine filtering text based on the per word perplexity given the assumption that highly surprising text is in fact non-coherent text pieces.

Note

The information theory components require an available lexeme prop table from spaCy which is not available for all languages. A warning will be raised and values set to np.nan if the table cannot be found for the language.

Usage#

import spacy
from textdescriptives as td
nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("textdescriptives/information_theory")

doc = nlp("This is a simple text")

# extract perplexity
doc._.perplexity

# extract entropy
doc._.entropy

# extract all metrics to a dataframe
td.extract_df(doc)

text

entropy

perplexity

per_word_perplexity

0

This is a very likely sentence.

0.288195

1.334017

0.190574


Component#

textdescriptives.components.information_theory.create_information_theory_component(nlp: Language, name: str) InformationTheory[source]#

Allows the InformationTheory component to be added to the spaCy pipeline using the command: nlp.add_pipe(‘textdescriptives/information_theory’)

It also set the following attributes on the document and span:

  • {Doc/Span}._.entropy: The shannon entropy of the document.

  • {Doc/Span}._.perplexity: The perplexity of the document.

  • {Doc/Span}._.per_word_perplexity: The per word perplexity of the document.

  • {Doc/Span}._.information_theory: A dictionary with the

    keys: entropy, perplexity, and per_word_perplexity.

Parameters:
  • nlp (-) – The spaCy Language object.

  • name (-) – The name of the component.

Example

>>> import spacy
>>> import textdescriptives as td
>>> nlp = spacy.blank('en')
>>> nlp.add_pipe('textdescriptives/information_theory')
>>> doc = nlp('This is a sentence.')
>>> doc._.information_theory
{'entropy': ...