Extractor#
The extractors are used to extract the features from the document. extract_metrics
is meant to be called on raw texts, whereas extract_df
and extract_dict
work on spaCy documents (Doc
).
API#
- textdescriptives.extractors.extract_metrics(text: Union[str, List[str]], lang: Optional[str] = None, metrics: Optional[Iterable[str]] = None, spacy_model: Optional[str] = None, spacy_model_size: str = 'lg') DataFrame [source]#
Extract metrics from a text or a list of texts to a Pandas dataframe.
- Parameters:
text (Union[str, List[str]]) – A text or a list of texts.
lang (str, optional) – Language of the text. If lang is set and no spacy model is provided, will automatically download and use a spacy model for the language. Defaults to None.
metrics (List[str]) – Which metrics to extract. One or more of [“descriptive_stats”, “readability”, “dependency_distance”, “pos_proportions”, “coherence”, “quality”, “information_theory”]. If None, will extract all metrics from textdescriptives. Defaults to None.
spacy_model (str, optional) – The spacy model to use. If not set, will download one based on lang. Defaults to None.
spacy_model_size (str, optional) – Size of the spacy model to download.
- Returns:
DataFrame with a row for each text and column for each metric.
- Return type:
pd.DataFrame
- textdescriptives.extractors.extract_df(docs: Union[Iterable[Doc], Doc], metrics: Optional[Union[List[str], str]] = None, include_text: bool = True) DataFrame [source]#
Extract calculated metrics from a spaCy Doc object or a generator of Docs from nlp.pipe to a Pandas DataFrame.
- Parameters:
docs (Union[Iterable[Doc], Doc]) – An iterable of spaCy Docs or a single Doc
metrics (Union[list[str], str], optional) – Which metrics to extract. One or more of [“descriptive_stats”, “readability”, “dependency_distance”, “pos_proportions”, “coherence”, “quality”, “information_theory”]. Defaults to None in which case it will extract metrics for which a pipeline compoenent has been set.
include_text (bool, optional) – Whether to add a column containing the text. Defaults to True.
- Returns:
DataFrame with a row for each doc and column for each metric.
- Return type:
pd.DataFrame
- textdescriptives.extractors.extract_dict(docs: Union[Iterable[Doc], Doc], metrics: Optional[Union[List[str], str]] = None, include_text: bool = True) List[Dict[str, Any]] [source]#
Extract calculated metrics from a spaCy Doc or an iterable of Docs to a list of dictionaries.
- Parameters:
docs (Union[Iterable[Doc], Doc]) – An iterable of spaCy Docs or a single Doc
metrics (Union[list[str], str, None], optional) – Which metrics to extract. One or more of [“descriptive_stats”, “readability”, “dependency_distance”, “pos_proportions”, “coherence”, “quality”, “information_theory”]. Defaults to None in which case it will extract metrics for which a pipeline compoenent has been set.
include_text (bool, optional) – Whether to add an entry containing the text. Defaults to True.
- Returns:
List of dictionaries for each Doc with extracted metrics.
- Return type:
List[Dict[str, Any]]