Quick Start#
Use extract_metrics
to quickly extract your desired metrics. Available metrics are ["descriptive_stats", "readability", "dependency_distance", "pos_proportions", "coherence", "quality]
Set the spacy_model
parameter to specify which spaCy model to use, otherwise, TextDescriptives will auto-download an appropriate one based on lang
. If lang
is set, spacy_model
is not necessary and vice versa.
Specify which metrics to extract in the metrics
argument. None
extracts all metrics.
import textdescriptives as td
text = "The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it."
# will automatically download the relevant model (´en_core_web_lg´) and extract all metrics
df = td.extract_metrics(text=text, lang="en", metrics=None)
# specify spaCy model and which metrics to extract
df = td.extract_metrics(text=text, spacy_model="en_core_web_lg", metrics=["readability", "coherence"])
Usage with spaCy#
To integrate with other spaCy pipelines, import the library and add the component(s) to your pipeline using the standard spaCy syntax. Available components are descriptive_stats
, readability
, dependency_distance
, pos_proportions
, coherence
, and quality
prefixed with textdescriptives/
.
If you want to add all the components you can use the shorthand textdescriptives/all
.
import spacy
import textdescriptives as td
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textdescriptives/all")
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")
# access some of the values
doc._.readability
doc._.token_length
The calculated metrics can be conveniently extracted to a Pandas DataFrame using the extract_df
function or to a dictionary using the extract_dict
function.
td.extract_df(doc)
td.extract_dict(doc)
You can control which measures to extract with the metrics argument.
td.extract_df(doc, metrics = ["descriptive_stats", "readability", "dependency_distance", "pos_proportions", "coherence", "quality", "information_theory"])
Note
By default, the extract_X
functions adds a column containing the text. You can change this behaviour by setting include_text = False
.
extract_df
and extract_dict
also work on objects created by nlp.pipe
. The output will be formatted with 1 row for each document and a column for each metric.
docs = nlp.pipe(['The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.',
'He felt that his whole life was some kind of dream and he sometimes wondered whose it was and whether they were enjoying it.'])
td.extract_df(docs, metrics = "dependency_distance")
Using Specific Components#
TextDescriptives includes 6 components that can be used individually: descriptive_stats
, readability
, dependency_distance
, pos_proportions
, coherence
, and quality
.
This can be helpful if you’re only interested in e.g. readabiltiy metrics or descriptive statistics and don’t to run a dependency parser.
If you have imported the TextDesriptives package you can add them to a pipe using the standard spaCy syntax.
nlp = spacy.blank("da")
nlp.add_pipe("textdescriptives/descriptive_stats")
docs = nlp.pipe(['Da jeg var atten, tog jeg patent på ild. Det skulle senere vise sig at blive en meget indbringende forretning',
"Spis skovsneglen, Mulle. Du vil jo gerne være med i hulen, ikk'?"])
# extract_df is clever enough to only extract metrics that are in the Doc
td.extract_df(docs, include_text = False)
Available Attributes#
The table below shows the metrics included in TextDecriptives and the attributes they set on spaCy’s Doc
, Span
, and Token
objects.
For more details on each metrics, see the following sections in the documentation.
Attribute |
Component |
Description |
---|---|---|
|
descriptive_stats |
Dict containing mean, median, and std of token length. |
|
descriptive_stats |
Dict containing mean, median, and std of sentence length. |
|
descriptive_stats |
Dict containing mean, median, and std of number of syllables per token. |
|
descriptive_stats |
Dict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the Doc. |
|
readability |
Dict containing Flesch Reading Ease, Flesch-Kincaid Grade, SMOG, Gunning-Fog, Automated Readability Index, Coleman-Liau Index, LIX, and RIX readability metrics for the Doc. |
|
dependency_distance |
Dict containing the mean and standard deviation of the dependency distance and proportion adjacent dependency relations in the Doc. |
|
pos_proportions |
Dict containing the proportion of each part-of-speech tag in the Doc. |
|
coherence |
Dict containing the first and second order coherence scores for the Doc. |
|
quality |
Dict containing the quality scores for the Doc. |
|
quality |
Boolean indicator of whether the doc passed the quality check. |
|
information_theory |
Dict containing the information theory scores for the Doc. |
|
information_theory |
The entropy score for the Doc as a float. |
|
information_theory |
The perplexity score for the Doc as a float. |
|
information_theory |
The per-word perplexity score for the Doc as a float. |
|
descriptive_stats |
Dict containing mean, median, and std of token length in the span. |
|
descriptive_stats |
Dict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the span. |
|
dependency_distance |
Dict containing the mean dependency distance and proportion adjacent dependency relations in the Doc. |
|
quality |
Dict containing the quality scores for the Span. |
|
information_theory |
The entropy score for the Span as a float. |
|
information_theory |
The perplexity score for the Span as a float. |
|
information_theory |
The per-word perplexity score for the Span as a float. |
|
dependency_distance |
Dict containing the dependency distance and whether the head word is adjacent for a Token. |