Descriptive Statistics#
The descriptive_stats component extracts a number of descriptive statistics. The following attributes are added:
{Doc/Span}._.counts
Number of tokens.
Number of unique tokens.
Proportion unique tokens.
Number of characters.
{Doc/Span}._.sentence_length
Mean sentence length.
Median sentence length.
Std of sentence length.
{Doc/Span}._.syllables
Mean number of syllables per token.
Median number of syllables per token.
Std of number of syllables per token.
{Doc/Span}._.token_length
Mean token length.
Median token length.
Std of token length.
Usage#
import spacy
import textdescriptives as td
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textdescriptives/descriptive_stats")
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")
# all attributes are stored as a single dict in the ._.descriptive_stats attribute
doc._.descriptive_stats
# or individually
doc._.counts
doc._.sentence_length
doc._.syllables
doc._.token_length
# extract to dataframe
td.extract_df(doc)
text |
token_length_mean |
token_length_median |
token_length_std |
sentence_length_mean |
sentence_length_median |
sentence_length_std |
syllables_per_token_mean |
syllables_per_token_median |
syllables_per_token_std |
n_tokens |
n_unique_tokens |
proportion_unique_tokens |
n_characters |
n_sentences |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
The world is changed(…) |
3.28571 |
3 |
1.54127 |
7 |
6 |
3.09839 |
1.08571 |
1 |
0.368117 |
35 |
23 |
0.657143 |
121 |
5 |
Component#
- textdescriptives.components.descriptive_stats.create_descriptive_stats_component(nlp: Language, name: str, verbose: bool) Callable[[Doc], Doc] [source]#
Allows DescriptiveStatistics to be added to a spaCy pipe using nlp.add_pipe(“textdescriptives/descriptive_stats”).
Adding the component to the pipe will add the following attributes to Doc and Span objects:
doc._.n_sentences
doc._.n_tokens
doc._.token_length
doc._.sentence_length
doc._.syllables
doc._.counts
doc._.descriptive_stats
span._.token_length
span._.counts
span._.descriptive_stats
- Parameters:
nlp (Language) – spaCy language object, does not need to be specified in the nlp.add_pipe call.
name (str) – name of the component. Can be optionally specified in the nlp.add_pipe call, using the name argument.
- Returns:
DescriptiveStatistics component
- Return type:
Callable[[Doc], Doc]
Example
>>> import spacy >>> import textdescriptives as td >>> nlp = spacy.blank("en") >>> # add sentencizer >>> nlp.add_pipe("sentencizer") >>> # add descriptive stats >>> nlp.add_pipe("textdescriptives/descriptive_stats") >>> # apply to a document >>> doc = nlp("This is a sentence. This is another sentence.") >>> doc._.descriptive_stats