Information Theory#
The information_theory component calculates information theoretic measures derived from the text. These include:
{doc/span}._.entropy: the Shannon entropy of the text using the token.prob as the probability of each token. Entropy is defined as \(H(X) = -\sum_{i=1}^n p(x_i) \log_e p(x_i)\). Where \(p(x_i)\) is the probability of the token \(x_i\).
{doc/span}._.perplexity: the perplexity of the text. perplexity is a measurement of how well a probability distribution or probability model predicts a sample. Perplexity is defined as \(PPL(X) = e^{-H(X)}\), where \(H(X)\) is the entropy of the text.
{doc/span}._.per_word_perplexity: The perplexity of the text, divided by the number of words. Can se considered the length normalized perplexity.
These information theoretic measures is for example often used to describe the complexity of a text. The higher the entropy, the more complex the text is. Similarly, one could imagine filtering text based on the per word perplexity given the assumption that highly surprising text is in fact non-coherent text pieces.
Note
The information theory components require an available lexeme prop table from spaCy which is not available for all languages. A warning will be raised and values set to np.nan if the table cannot be found for the language.
Usage#
import spacy
from textdescriptives as td
nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("textdescriptives/information_theory")
doc = nlp("This is a simple text")
# extract perplexity
doc._.perplexity
# extract entropy
doc._.entropy
# extract all metrics to a dataframe
td.extract_df(doc)
text |
entropy |
perplexity |
per_word_perplexity |
|
---|---|---|---|---|
0 |
This is a very likely sentence. |
0.288195 |
1.334017 |
0.190574 |
Component#
- textdescriptives.components.information_theory.create_information_theory_component(nlp: Language, name: str) InformationTheory [source]#
Allows the InformationTheory component to be added to the spaCy pipeline using the command: nlp.add_pipe(‘textdescriptives/information_theory’)
It also set the following attributes on the document and span:
{Doc/Span}._.entropy: The shannon entropy of the document.
{Doc/Span}._.perplexity: The perplexity of the document.
{Doc/Span}._.per_word_perplexity: The per word perplexity of the document.
- {Doc/Span}._.information_theory: A dictionary with the
keys: entropy, perplexity, and per_word_perplexity.
- Parameters:
nlp (-) – The spaCy Language object.
name (-) – The name of the component.
Example
>>> import spacy >>> import textdescriptives as td >>> nlp = spacy.blank('en') >>> nlp.add_pipe('textdescriptives/information_theory') >>> doc = nlp('This is a sentence.') >>> doc._.information_theory {'entropy': ...