Quick Start#

Use extract_metrics to quickly extract your desired metrics. Available metrics are ["descriptive_stats", "readability", "dependency_distance", "pos_proportions", "coherence", "quality]

Set the spacy_model parameter to specify which spaCy model to use, otherwise, TextDescriptives will auto-download an appropriate one based on lang. If lang is set, spacy_model is not necessary and vice versa.

Specify which metrics to extract in the metrics argument. None extracts all metrics.

import textdescriptives as td

text = "The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it."
# will automatically download the relevant model (´en_core_web_lg´) and extract all metrics
df = td.extract_metrics(text=text, lang="en", metrics=None)

# specify spaCy model and which metrics to extract
df = td.extract_metrics(text=text, spacy_model="en_core_web_lg", metrics=["readability", "coherence"])

Usage with spaCy#

To integrate with other spaCy pipelines, import the library and add the component(s) to your pipeline using the standard spaCy syntax. Available components are descriptive_stats, readability, dependency_distance, pos_proportions, coherence, and quality prefixed with textdescriptives/. If you want to add all the components you can use the shorthand textdescriptives/all.

import spacy
import textdescriptives as td
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textdescriptives/all")
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")

# access some of the values
doc._.readability
doc._.token_length

The calculated metrics can be conveniently extracted to a Pandas DataFrame using the extract_df function or to a dictionary using the extract_dict function.

td.extract_df(doc)
td.extract_dict(doc)

You can control which measures to extract with the metrics argument.

td.extract_df(doc, metrics = ["descriptive_stats", "readability", "dependency_distance", "pos_proportions", "coherence", "quality", "information_theory"])

Note

By default, the extract_X functions adds a column containing the text. You can change this behaviour by setting include_text = False.

extract_df and extract_dict also work on objects created by nlp.pipe. The output will be formatted with 1 row for each document and a column for each metric.

docs = nlp.pipe(['The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.',
'He felt that his whole life was some kind of dream and he sometimes wondered whose it was and whether they were enjoying it.'])
td.extract_df(docs, metrics = "dependency_distance")

Using Specific Components#

TextDescriptives includes 6 components that can be used individually: descriptive_stats, readability, dependency_distance, pos_proportions, coherence, and quality. This can be helpful if you’re only interested in e.g. readabiltiy metrics or descriptive statistics and don’t to run a dependency parser. If you have imported the TextDesriptives package you can add them to a pipe using the standard spaCy syntax.

nlp = spacy.blank("da")
nlp.add_pipe("textdescriptives/descriptive_stats")
docs = nlp.pipe(['Da jeg var atten, tog jeg patent på ild. Det skulle senere vise sig at blive en meget indbringende forretning',
         "Spis skovsneglen, Mulle. Du vil jo gerne være med i hulen, ikk'?"])
# extract_df is clever enough to only extract metrics that are in the Doc
td.extract_df(docs, include_text = False)

Available Attributes#

The table below shows the metrics included in TextDecriptives and the attributes they set on spaCy’s Doc, Span, and Token objects. For more details on each metrics, see the following sections in the documentation.

Attribute	Component	Description
`Doc._.token_length`	descriptive_stats	Dict containing mean, median, and std of token length.
`Doc._.sentence_length`	descriptive_stats	Dict containing mean, median, and std of sentence length.
`Doc._.syllables`	descriptive_stats	Dict containing mean, median, and std of number of syllables per token.
`Doc._.counts`	descriptive_stats	Dict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the Doc.
`Doc._.readability`	readability	Dict containing Flesch Reading Ease, Flesch-Kincaid Grade, SMOG, Gunning-Fog, Automated Readability Index, Coleman-Liau Index, LIX, and RIX readability metrics for the Doc.
`Doc._.dependency_distance`	dependency_distance	Dict containing the mean and standard deviation of the dependency distance and proportion adjacent dependency relations in the Doc.
`Doc._.pos_proportions`	pos_proportions	Dict containing the proportion of each part-of-speech tag in the Doc.
`Doc._.coherence`	coherence	Dict containing the first and second order coherence scores for the Doc.
`Doc._.quality`	quality	Dict containing the quality scores for the Doc.
`Doc._.passed_quality_check`	quality	Boolean indicator of whether the doc passed the quality check.
`Doc._.information_theory`	information_theory	Dict containing the information theory scores for the Doc.
`Doc._.entropy`	information_theory	The entropy score for the Doc as a float.
`Doc._.perplexity`	information_theory	The perplexity score for the Doc as a float.
`Doc._.per_word_perplexity`	information_theory	The per-word perplexity score for the Doc as a float.
`Span._.token_length`	descriptive_stats	Dict containing mean, median, and std of token length in the span.
`Span._.counts`	descriptive_stats	Dict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the span.
`Span._.dependency_distance`	dependency_distance	Dict containing the mean dependency distance and proportion adjacent dependency relations in the Doc.
`Span._.quality`	quality	Dict containing the quality scores for the Span.
`Span._.entropy`	information_theory	The entropy score for the Span as a float.
`Span._.perplexity`	information_theory	The perplexity score for the Span as a float.
`Span._.per_word_perplexity`	information_theory	The per-word perplexity score for the Span as a float.
`Token._.dependency_distance`	dependency_distance	Dict containing the dependency distance and whether the head word is adjacent for a Token.