Descriptive Statistics#

The descriptive_stats component extracts a number of descriptive statistics. The following attributes are added:

  • {Doc/Span}._.counts

    • Number of tokens.

    • Number of unique tokens.

    • Proportion unique tokens.

    • Number of characters.

  • {Doc/Span}._.sentence_length

    • Mean sentence length.

    • Median sentence length.

    • Std of sentence length.

  • {Doc/Span}._.syllables

    • Mean number of syllables per token.

    • Median number of syllables per token.

    • Std of number of syllables per token.

  • {Doc/Span}._.token_length

    • Mean token length.

    • Median token length.

    • Std of token length.

Usage#

import spacy
import textdescriptives as td
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textdescriptives/descriptive_stats")
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")

# all attributes are stored as a single dict in the ._.descriptive_stats attribute
doc._.descriptive_stats

# or individually
doc._.counts
doc._.sentence_length
doc._.syllables
doc._.token_length

# extract to dataframe
td.extract_df(doc)

text

token_length_mean

token_length_median

token_length_std

sentence_length_mean

sentence_length_median

sentence_length_std

syllables_per_token_mean

syllables_per_token_median

syllables_per_token_std

n_tokens

n_unique_tokens

proportion_unique_tokens

n_characters

n_sentences

0

The world is changed(…)

3.28571

3

1.54127

7

6

3.09839

1.08571

1

0.368117

35

23

0.657143

121

5


Component#

textdescriptives.components.descriptive_stats.create_descriptive_stats_component(nlp: Language, name: str, verbose: bool) Callable[[Doc], Doc][source]#

Allows DescriptiveStatistics to be added to a spaCy pipe using nlp.add_pipe(“textdescriptives/descriptive_stats”).

Adding the component to the pipe will add the following attributes to Doc and Span objects:

  • doc._.n_sentences

  • doc._.n_tokens

  • doc._.token_length

  • doc._.sentence_length

  • doc._.syllables

  • doc._.counts

  • doc._.descriptive_stats

  • span._.token_length

  • span._.counts

  • span._.descriptive_stats

Parameters:
  • nlp (Language) – spaCy language object, does not need to be specified in the nlp.add_pipe call.

  • name (str) – name of the component. Can be optionally specified in the nlp.add_pipe call, using the name argument.

Returns:

DescriptiveStatistics component

Return type:

Callable[[Doc], Doc]

Example

>>> import spacy
>>> import textdescriptives as td
>>> nlp = spacy.blank("en")
>>> # add sentencizer
>>> nlp.add_pipe("sentencizer")
>>> # add descriptive stats
>>> nlp.add_pipe("textdescriptives/descriptive_stats")
>>> # apply to a document
>>> doc = nlp("This is a sentence. This is another sentence.")
>>> doc._.descriptive_stats