Quality#

The quality component adds the following quality metrics under the ._.quality` attribute to Doc and Span objects.

Heuristic quality metrics:

  • Number of stop words (n_stop_words`): The number of stop words in the document.

  • Alpha Ratio (alpha_ratio): Ratio of words containing at least one alphabetic characters.

  • Mean word length (mean_word_length): Mean/average word length.

  • Proportion of ellipsis (proportion_ellipsis): Proportion of lines in a documents which end with an ellipsis.

  • Proportion of bullet points (proportion_bullet_points): Proportion of lines in a documents which start with a bullet point.

  • Symbol to word ratio (symbol_{symbol}_2_word_ratio): Ratio of specified symbols to words, could e.g. include ratio of hashtags or curly brackets.

  • Contains string (contains_{string}): Whether the document contains a specified string. For instance documents containing the string “lorem ipsum”.

  • Out of vocabulary ratio (oov_ratio): Ratio of out of vocabulary words to total words.

Repetitious text metrics:

  • Duplicate lines character fraction (duplicate_lines_chr_fraction): Fraction of characters in a document which are contained within duplicate lines.

  • Duplicate paragraphs character fraction (duplicate_paragraphs_chr_fraction): Fraction of characters in a document which are contained within duplicate paragraphs.

  • Duplicate n-gram character fraction (duplicate_{n}_gram_chr_fraction): Fraction of characters in a document which are contained within duplicate n-grams. For a specified n-gram range.

  • Top n-gram character fraction (top_{n}_gram_chr_fraction): Fraction of characters in a document which are contained within the top n-grams. For a specified n-gram range.

These quality metrics were for example used by Rae et al. (2021) and Raffel et al. (2020) to filter large text corpora for pre-training language models.

Note: this implementation is not optimized for speed, but rather for usability, simplicity, and spacy integration. If you need to run quality filters on a large corpus, you should consider using the implementation from Danish Foundation Models which also includes a number of other quality filters and deduplication strategies.

Usage#

import spacy
import textdescriptives as td
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textdescriptives/quality")
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")

# all attributes are stored as a dict in the ._.quality attribute
doc._.quality
# check if the document passed all quality checks
doc._.passed_quality_check

# extract to dataframe
td.extract_df(doc)

text

n_stop_words

alpha_ratio

mean_word_length

doc_length

proportion_ellipsis

proportion_bullet_points

duplicate_line_chr_fraction

duplicate_paragraph_chr_fraction

duplicate_5-gram_chr_fraction

duplicate_6-gram_chr_fraction

duplicate_7-gram_chr_fraction

duplicate_8-gram_chr_fraction

duplicate_9-gram_chr_fraction

duplicate_10-gram_chr_fraction

top_2-gram_chr_fraction

top_3-gram_chr_fraction

top_4-gram_chr_fraction

symbol_#_2_word_ratio

contains_lorem ipsum

passed_quality_check

0

The world is changed(…)

24

0.853659

2.95122

41

0

0

0

0

0.232258

0.232258

0

0

0

0

0.0580645

0.174194

0

0

False

False

If you want to specify the thresholds for the quality metrics, you can do so by passing a QualityThresholds object to the component.

import spacy
import textdescriptives as td
nlp = spacy.load("en_core_web_sm")

# set thresholds for quality metrics (these are just the default)
thresholds = QualityThresholds(
    n_stop_words=(2, None),   # at least 2 stop words, no upper bound
    alpha_ratio=(0.7, None),
    mean_word_length=(3, 10),  # mean word length between 3 and 10 characters
    doc_length=(10, 100000),
    symbol_to_word_ratio={"#": (None, 0.1)},
    proportion_ellipsis=(None, 0.3),
    proportion_bullet_points=(None, 0.8),
    contains={"lorem ipsum": False},
    duplicate_line_chr_fraction=(None, 0.2),
    duplicate_paragraph_chr_fraction=(None, 0.2),
    duplicate_ngram_chr_fraction={
        "5": (None, 0.15),
        "6": (None, 0.14),
        "7": (None, 0.13),
        "8": (None, 0.12),
        "9": (None, 0.11),
        "10": (None, 0.1),
    },
    top_ngram_chr_fraction={"2": (None, 0.2), "3": (None, 0.18), "4": (None, 0.16)},
    oov_ratio=(None, 0.2)
)


quality_pipe = nlp.add_pipe("textdescriptives.quality")
quality_pipe.set_quality_thresholds(thresholds)  # update the quality thresholds
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")

# all attributes are stored as a dict in the ._.quality attribute
doc._.quality
# check if the document passed all quality checks
doc._.passed_quality_check

Component#

textdescriptives.components.quality.create_quality_component(nlp: Language, name: str, top_ngram_range: Tuple[int, int], top_ngram_min_count: int, duplicate_n_gram_fraction_range: Tuple[int, int], vocab: Optional[Mapping], force: bool = True) Callable[[Doc], Doc][source]#

Allows Quality to be added to a spaCy pipe using nlp.add_pipe(“textdescriptives/quality”).

Adding this component to a pipeline sets the following attributes:

  • {Span/Doc}._.quality

  • {Span/Doc}._.passed_quality_check

It also sets:

  • {Span/Doc}._.lines

  • {Span/Doc}._.paragraphs

These are used to calculate some of the quality metrics. They can be overwritten if you e.g. wish lines to be split on “rn” instead of “n”.

A large part of the quality metrics were proposed by [1] and [2] for filtering out low quality text from large text corpora.

References: - [1] Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., … & Irving, G. (2021). Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446. - [2] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., … & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140), 1-67.

Parameters:
  • nlp (Language) – spaCy language object, does not need to be specified in the nlp.add_pipe call.

  • name (str) – name of the component. Can be optionally specified in the nlp.add_pipe call, using the name argument.

  • top_ngram_range (Tuple[int]) – range of n-grams to calculate the proportion of the top n-gram. Defaults to [2, 4].

  • top_ngram_min_count (int) – minimum number of times a n-gram must occur to be considered a top n-gram. Defaults to 3.

  • duplicate_n_gram_fraction_range (Tuple[int]) – range of n-grams to calculate the proportion of duplicate n-grams. Defaults to [5, 10].

  • vocab (Optional[Mapping]) – vocabulary to use for calculating the out-of-vocabulary ratio (oov_ratio). If None, will use the vocabulary of the spaCy model. Note, that small spaCy models do not have a vocabulary. The attribute will only be set if the vocabulary is not None or the spaCy model is medium or large.

  • force (bool) – whether to overwrite existing extensions. Defaults to True.

Returns:

the Quality component

Return type:

Callable[[Doc], Doc]

Example

>>> import spacy
>>> from spacy_quality import Quality
>>> nlp = spacy.blank("en")
>>> nlp.add_pipe("quality")
>>> doc = nlp("This is a test")
>>> # extract quality metrics
>>> doc._.quality
>>> # check whether the document passed the quality thresholds
>>> doc._.passed_quality_check

Data Classes#

pydantic model textdescriptives.components.quality_data_classes.QualityThresholds[source]#

Thresholds for quality metrics.

Config:
  • extra: str = forbid

field alpha_ratio: Tuple[Optional[float], Optional[float]] = (0.7, None)#

A Range for the alpha ratio. Default: (0.7, None), i.e. at least 70% of tokens contain at least one alphabetic character, but no upper limit. Note this is lowered from the original 0.8 to account for adifferent definition of word boundaries. E.g. in spaCy a punctuation isnot a part of a word.

field contains: Dict[str, bool] = {'lorem ipsum': False}#

A dictionary of strings and whether they should be contained in the document. Default: {‘lorem ipsum’: False}, i.e. the document should not contain the string ‘lorem ipsum’.

field doc_length: Tuple[Optional[float], Optional[float]] = (10, 100000)#

A Range for the document length. Default: (10, 100_000), i.e. between 10 and 100_000 words (spacy tokens).

field duplicate_line_chr_fraction: Tuple[Optional[float], Optional[float]] = (None, 0.2)#

A Range for the duplicate line character fraction. Default: (None, 0.2), i.e. no lower limit, but at most 20% of characters are duplicates.

field duplicate_ngram_chr_fraction: Dict[str, Tuple[Optional[float], Optional[float]]] = {'10': (None, 0.1), '5': (None, 0.15), '6': (None, 0.14), '7': (None, 0.13), '8': (None, 0.12), '9': (None, 0.11)}#

A dictionary of n-gram lengths and the allowed range for the duplicate n-gram character fraction. Default: {5: (None, 0.15), 6: (None, 0.14), 7: (None, 0.13), 8: (None, 0.12), 9: (None, 0.11), 10: (None, 0.1)}, i.e. no lower limit, but at most 15% of characters are duplicates for 5-grams, 14% for 6-grams, 13% for 7-grams, 12% for 8-grams, 11% for 9-grams and 10% for 10-grams.

field duplicate_paragraph_chr_fraction: Tuple[Optional[float], Optional[float]] = (None, 0.2)#

A Range for the duplicate paragraph character fraction. Default: (None, 0.2), i.e. no lower limit, but at most 20% of characters are duplicates.

field mean_word_length: Tuple[Optional[float], Optional[float]] = (3, 10)#

A Range for the mean word length. Default: (3, 10), i.e. between 3 and 10 characters.

field n_stop_words: Tuple[Optional[float], Optional[float]] = (2, None)#

A Range for the number of stop words. Default: (2, None), i.e. at least 2 stop words, but no upper limit.

field oov_ratio: Tuple[Optional[float], Optional[float]] = (None, 0.2)#

A range for the out-of-vocabulary ratio. Default: (None, 0.2) i.e. no lower limit, but at most 20% of words are out-of-vocabulary.

field proportion_bullet_points: Tuple[Optional[float], Optional[float]] = (None, 0.8)#

A Range for the proportion lines which start with a bullet points. Default: (None, 0.8), i.e. no lower limit, but at most 80% of lines start with a bullet point.

field proportion_ellipsis: Tuple[Optional[float], Optional[float]] = (None, 0.3)#

A Range for the proportion of lines which end with ellipsis. Default: (None, 0.3), i.e. no lower limit, but at most 30% of lines end with an ellipsis.

field symbol_to_word_ratio: Dict[str, Tuple[Optional[float], Optional[float]]] = {'#': (None, 0.1)}#

A dict of symbols and the allowed range for the symbol-to-word-ratio. The symbol-to-word-ratio is the ratio between symboloccurrence and word occurrence. Defaults to {‘#’: (None, 0.1)} i.e. no lower limit, but there must at most be a ratio of 0.1 between the number of of words and hashtags. i.e. if we have 100 words the symbol should appear no more than 10 times. Values not in the dict are not checked.

field top_ngram_chr_fraction: Dict[str, Tuple[Optional[float], Optional[float]]] = {'2': (None, 0.2), '3': (None, 0.18), '4': (None, 0.16)}#

A dictionary of n-gram lengths and the allowed range for the top n-gram character fraction. Default: {2: (None, 0.2), 3: (None, 0.18), 4: (None, 0.16)}, i.e. no lower limit, but at most 20% of characters are contained within a duplicate for 2-grams, 18% for 3-grams and 16% for 4-grams.

pydantic model textdescriptives.components.quality_data_classes.QualityOutput[source]#

The output of the quality function.

Config:
  • extra: str = forbid

field alpha_ratio: ThresholdsOutput [Required]#

The thresholds output for the alpha ratio.

field contains: Dict[str, ThresholdsOutput] [Required]#

The thresholds output for the presence of strings.

field doc_length: ThresholdsOutput [Required]#

The thresholds output for the document length.

field duplicate_line_chr_fraction: ThresholdsOutput [Required]#

The thresholds output for the duplicate line character fraction.

field duplicate_ngram_chr_fraction: Dict[str, ThresholdsOutput] [Required]#

The thresholds output for the duplicate n-gram character fraction.

field duplicate_paragraph_chr_fraction: ThresholdsOutput [Required]#

The thresholds output for the duplicate paragraph character fraction.

field mean_word_length: ThresholdsOutput [Required]#

The thresholds output for the mean word length.

field n_stop_words: ThresholdsOutput [Required]#

The thresholds output for the number of stop words.

field oov_ratio: ThresholdsOutput [Required]#

The thresholds output for the out-of-vocabulary ratio.

field proportion_bullet_points: ThresholdsOutput [Required]#

The thresholds output for the proportion of lines starting with bullet points.

field proportion_ellipsis: ThresholdsOutput [Required]#

The thresholds output for the proportion of lines ending with ellipsis.

field symbol_to_word_ratio: Dict[str, ThresholdsOutput] [Required]#

The thresholds output for the symbol-to-word-ratio.

field top_ngram_chr_fraction: Dict[str, ThresholdsOutput] [Required]#

The thresholds output for the top n-gram character fraction.

to_flat_value_dict() Dict[str, Any][source]#

Creates a flat dictionary representation of the object to allow for easy easy conversion to a pandas DataFrame.

property passed: bool#

Returns: bool: Whether all thresholds have been passed.

pydantic model textdescriptives.components.quality_data_classes.ThresholdsOutput[source]#

An output which contains an three items. 1) a thresholds which is either an interval or a accepted boolean value. 2) a value which is the value of the metric. 3) a boolean which is True if the value is within the thresholds.

Example

>>> t_out = ThresholdsOutput(threshold=(0, 2), value=2)
>>> t_out
ThresholdsOutput(value=2.0, passed=True, threshold=(0.0, 2.0))
>>> t_out.passed
True
Config:
  • extra: str = forbid

field threshold: Optional[Union[Tuple[Optional[float], Optional[float]], bool]] [Required]#
field value: Optional[float] [Required]#
property passed: Optional[bool]#

Return True if the value is within the thresholds.