Information Theory#
The information_theory component calculates information theoretic measures derived from the text. These include:
{doc/span}._.entropy: the Shannon entropy of the text using the token.prob as the probability of each token. Entropy is defined as \(H(X) = -\sum_{i=1}^n p(x_i) \log_e p(x_i)\). Where \(p(x_i)\) is the probability of the token \(x_i\).
{doc/span}._.perplexity: the perplexity of the text. Perplexity is a measurement of how well a probability distribution or probability model predicts a sample. Perplexity is defined as \(PPL(X) = e^{-H(X)}\), where \(H(X)\) is the entropy of the text.
{doc/span}._.per_word_perplexity: The perplexity of the text, divided by the number of words. Can be considered the length-normalized perplexity.
These information theoretic measures are for example often used to describe the complexity of a text. The higher the entropy, the more complex the text is. Similarly, one could imagine filtering text based on the per-word perplexity given the assumption that highly surprising text are in fact non-coherent text pieces.
Note
The information theory components require an available lexeme prop table from spaCy which is not available for all languages. A warning will be raised and values set to np.nan if the table cannot be found for the language.
Usage#
import spacy
from textdescriptives as td
nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("textdescriptives/information_theory")
doc = nlp("This is a simple text")
# extract perplexity
doc._.perplexity
# extract entropy
doc._.entropy
# extract all metrics to a dataframe
td.extract_df(doc)
text |
entropy |
perplexity |
per_word_perplexity |
|
---|---|---|---|---|
0 |
This is a very likely sentence. |
0.288195 |
1.334017 |
0.190574 |
Component#
- textdescriptives.components.information_theory.create_information_theory_component(nlp: Language, name: str) InformationTheory [source]#
Allows the InformationTheory component to be added to the spaCy pipeline using the command: nlp.add_pipe(‘textdescriptives/information_theory’)
It also set the following attributes on the document and span:
{Doc/Span}._.entropy: The shannon entropy of the document.
{Doc/Span}._.perplexity: The perplexity of the document.
{Doc/Span}._.per_word_perplexity: The per word perplexity of the document.
- {Doc/Span}._.information_theory: A dictionary with the
keys: entropy, perplexity, and per_word_perplexity.
- Parameters:
nlp (-) – The spaCy Language object.
name (-) – The name of the component.
Example
>>> import spacy >>> import textdescriptives as td >>> nlp = spacy.blank('en') >>> nlp.add_pipe('textdescriptives/information_theory') >>> doc = nlp('This is a sentence.') >>> doc._.information_theory {'entropy': ...