Readability#

The readability component adds the following readabiltiy metrics under the ._.readability attribute to Doc objects.

Note

Note, that the hyphenation module (Pyphen) does not support all languages. If the language is not supported, a warning will be raised and np.nan will be set as the value for metrics requiring hyphenation.

`Gunning-Fog <https://en.wikipedia.org/wiki/Gunning_fog_index>`__, is a readability index originally developed for English writing, but works for any language. The index estimates the years of formal education needed to understand the text on a first reading. A Gunning-Fog index of 12 requires the reading level of a U.S. high school senior (around 18 years old). The formula for calculating the index is:

Grade level = 0.4 × (ASL + PHW)

Where ASL is the average sentence length (total words / total sentences), and PHW is the percentage of hard words (words with three or more syllables).

Note: requires hyphenation.
`SMOG <https://en.wikipedia.org/wiki/SMOG>`__, or Simple Measure of Gobbledygook, is a readability formula that estimates the years of education required to understand a piece of writing. It primarily focuses on the complexity of words, using the number of polysyllabic words in the text. The formula is:

SMOG Index = 1.043 × √(30 × (hard words / n_sentences)) + 3.1291

Note: requires hyphenation.
`Flesch reading ease <https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch_reading_ease>`__, is a readability score that indicates how easy a text is to read. Higher scores indicate easier reading, while lower scores indicate more difficult reading. The score is calculated using the following formula:

Flesch Reading Ease = 206.835 - (1.015 × ASL) - (84.6 × ASW)

Where ASL is the average sentence length and ASW is the average number of syllables per word.

Note: requires hyphenation.
`Flesch-Kincaid grade <https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch%E2%80%93Kincaid_grade_level>`__, is a readability metric that estimates the grade level needed to comprehend a text. It is based on the average sentence length and average number of syllables per word. The formula is:

Flesch-Kincaid Grade = 0.39 × (ASL) + 11.8 × (ASW) - 15.59

Note: requires hyphenation.
`Automated readability index <https://en.wikipedia.org/wiki/Automated_readability_index>`__, is a readability test that calculates an approximate U.S. grade level needed to understand a text. It is based on the average number of characters per word and the average sentence length. The formula is:

ARI = 4.71 × (n_chars / n_words) + 0.5 × (n_words / n_sentences) - 21.43
`Coleman-Liau index <https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index>`___, is a readability test that estimates the U.S. grade level needed to understand a text. It is based on the average number of letters per 100 words and the average number of sentences per 100 words. The original formula is:

CLI = 0.0588 × L - 0.296 × S - 15.8

Where L is the average number of characters per 100 words and S is the average number of sentences per 100 words. In our implementation we average over the entire text instead of just 100 words.
`Lix <https://en.wikipedia.org/wiki/Lix_(readability_test)>`__, or Lesbarhetsindex, is a readability measure that calculates a readability score based on the average sentence length and the percentage of long words (more than six characters) in the text. The formula is:

Lix = (n_words / n_sentences) + (n_long_words * 100) / n_words
`Rix <https://www.jstor.org/stable/40031755>`__, is a readability measure that estimates the difficulty of a text based on the proportion of long words (more than six characters) in the text. The formula is:

Rix = (n_long_words / n_sentences)

Usage#

import spacy
import textdescriptives as td
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textdescriptives/readability")
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")

# all attributes are stored as a dict in the ._.readability attribute
doc._.readability

# extract to dataframe
td.extract_df(doc)

	text	flesch_reading_ease	flesch_kincaid_grade	smog	gunning_fog	automated_readability_index	coleman_liau_index	lix	rix	token_length_mean	token_length_median	token_length_std	sentence_length_mean	sentence_length_median	sentence_length_std	syllables_per_token_mean	syllables_per_token_median	syllables_per_token_std	n_tokens	n_unique_tokens	proportion_unique_tokens	n_characters	n_sentences
0	The world is changed(…)	107.879	-0.0485714	5.68392	3.94286	-2.45429	-0.708571	12.7143	0.4	3.28571	3	1.54127	7	6	3.09839	1.08571	1	0.368117	35	23	0.657143	121	5

Component#

textdescriptives.components.readability.create_readability_component(nlp: Language, name: str, verbose: bool) → Callable[[Doc], Doc][source]#

Allows Readability to be added to a spaCy pipe using nlp.add_pipe(“textdescriptives/readability”).

Readability requires attributes from DescriptiveStatistics and adds it to the pipe if it not already loaded.

Adding this component to a pipeline sets the following attributes:

doc._.readability

Parameters:

nlp (Language) – spaCy language object, does not need to be specified in the nlp.add_pipe call.
name (str) – name of the component. Can be optionally specified in the nlp.add_pipe call, using the name argument.
verbose (bool) – Toggle to show a message if the “textdescriptives/descriptive_stats” component is added to the pipeline. Defaults to True.

Returns:

The Readability component

Return type:

Callable[[Doc], Doc]

Example

>>> import spacy
>>> import textdescriptives as td
>>> nlp = spacy.blank("en")
>>> nlp.add_pipe("textdescriptives/readability")
>>> # apply the pipeline to a document
>>> doc = nlp("This is a test document.")
>>> doc._.readability