Extractor#

The extractors are used to extract the features from the document. extract_metrics is meant to be called on raw texts, whereas extract_df and extract_dict work on spaCy documents (Doc).

API#

textdescriptives.extractors.extract_metrics(text: Union[str, List[str]], lang: Optional[str] = None, metrics: Optional[Iterable[str]] = None, spacy_model: Optional[str] = None, spacy_model_size: str = 'lg') DataFrame[source]#

Extract metrics from a text or a list of texts to a Pandas dataframe.

Parameters:
  • text (Union[str, List[str]]) – A text or a list of texts.

  • lang (str, optional) – Language of the text. If lang is set and no spacy model is provided, will automatically download and use a spacy model for the language. Defaults to None.

  • metrics (List[str]) – Which metrics to extract. One or more of [“descriptive_stats”, “readability”, “dependency_distance”, “pos_proportions”, “coherence”, “quality”, “information_theory”]. If None, will extract all metrics from textdescriptives. Defaults to None.

  • spacy_model (str, optional) – The spacy model to use. If not set, will download one based on lang. Defaults to None.

  • spacy_model_size (str, optional) – Size of the spacy model to download.

Returns:

DataFrame with a row for each text and column for each metric.

Return type:

pd.DataFrame

textdescriptives.extractors.extract_df(docs: Union[Iterable[Doc], Doc], metrics: Optional[Union[List[str], str]] = None, include_text: bool = True) DataFrame[source]#

Extract calculated metrics from a spaCy Doc object or a generator of Docs from nlp.pipe to a Pandas DataFrame.

Parameters:
  • docs (Union[Iterable[Doc], Doc]) – An iterable of spaCy Docs or a single Doc

  • metrics (Union[list[str], str], optional) – Which metrics to extract. One or more of [“descriptive_stats”, “readability”, “dependency_distance”, “pos_proportions”, “coherence”, “quality”, “information_theory”]. Defaults to None in which case it will extract metrics for which a pipeline compoenent has been set.

  • include_text (bool, optional) – Whether to add a column containing the text. Defaults to True.

Returns:

DataFrame with a row for each doc and column for each metric.

Return type:

pd.DataFrame

textdescriptives.extractors.extract_dict(docs: Union[Iterable[Doc], Doc], metrics: Optional[Union[List[str], str]] = None, include_text: bool = True) List[Dict[str, Any]][source]#

Extract calculated metrics from a spaCy Doc or an iterable of Docs to a list of dictionaries.

Parameters:
  • docs (Union[Iterable[Doc], Doc]) – An iterable of spaCy Docs or a single Doc

  • metrics (Union[list[str], str, None], optional) – Which metrics to extract. One or more of [“descriptive_stats”, “readability”, “dependency_distance”, “pos_proportions”, “coherence”, “quality”, “information_theory”]. Defaults to None in which case it will extract metrics for which a pipeline compoenent has been set.

  • include_text (bool, optional) – Whether to add an entry containing the text. Defaults to True.

Returns:

List of dictionaries for each Doc with extracted metrics.

Return type:

List[Dict[str, Any]]