Introductory Tutorial#

TextDescriptives lets you quickly and easily calculate a large variety of statistics and metrics from texts.

The package includes a number of components that make it easy to only spend time extracting exactly the metrics you care about.

This tutorial introduces some of components available in TextDescriptives and how you can use them to quickly analyse a text corpus. For more information on the components, see the documentation.

Using TextDescriptives#

In this tutorial we’ll use TextDescriptives to get a quick overview of the SMS Spam Collection Data Set. The dataset contains 5572 SMS messages categorized as ham or spam.

To start, let’s load a dataset and get a bit familiar with it.

try:
    import textdescriptives
except:
    !pip install "textdescriptives[tutorials]"

import textdescriptives as td
import spacy

from textdescriptives.utils import load_sms_data

df = load_sms_data()
df.head()

	label	message
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...

df["label"].value_counts()

label
ham     4825
spam     747
Name: count, dtype: int64

The easy way: Extract metrics with `extract_metrics`#

Let’s extract readability and dependency distance metrics using the extract_metrics function, which handles a lot of boilerplate spaCy code for us. If haven’t used spacy before, you can set the parameter lang to the language of your text instead of specifying spacy_model - then TextDescriptives will automatically download and use an appropriate model for you.

# assumes you have downloaded the ´en_core_web_model´
metrics = td.extract_metrics(
    text=df["message"],
    spacy_model="en_core_web_sm",
    metrics=["readability", "dependency_distance"],
)
# alternatively, you can specify the language and (optionally) the model size
# metrics = td.extract_metrics(text=df["message"], lang="en", spacy_model_size="sm", metrics=["readability, dependency_distance"])

# join the metrics to the original dataframe to get the label
metrics_df = df.join(metrics.drop(columns=["text"]))
metrics_df.head()

	label	message	flesch_reading_ease	flesch_kincaid_grade	smog	gunning_fog	automated_readability_index	coleman_liau_index	lix	rix	...	sentence_length_median	sentence_length_std	syllables_per_token_mean	syllables_per_token_median	syllables_per_token_std	n_tokens	n_unique_tokens	proportion_unique_tokens	n_characters	n_sentences
0	ham	Go until jurong point, crazy.. Available only ...	94.318333	1.760000	3.1291	2.666667	1.449833	4.162000	11.666667	0.333333	...	5.0	2.357023	1.250000	1.0	0.433013	20	20	1.000000	92	3
1	ham	Ok lar... Joking wif u oni...	105.090000	-0.653333	NaN	1.200000	-5.800000	-8.026667	3.000000	0.000000	...	3.0	1.000000	1.166667	1.0	0.372678	6	6	1.000000	24	2
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...	97.654524	1.957143	3.1291	3.733333	3.927024	6.858571	23.619048	1.333333	...	12.0	5.249339	1.178571	1.0	0.382993	28	24	0.857143	128	3
3	ham	U dun say so early hor... U c already then say...	116.652500	-1.645000	NaN	2.200000	-4.550000	-3.541818	14.590909	0.500000	...	5.5	0.500000	1.000000	1.0	0.000000	11	9	0.818182	39	2
4	ham	Nah I don't think he goes to usf, he lives aro...	109.040000	1.280000	NaN	5.200000	1.373846	2.276923	13.000000	0.000000	...	13.0	0.000000	1.000000	1.0	0.000000	13	12	0.923077	49	1

5 rows × 28 columns

The configurable way: Add pipes to spaCy#

If you want to use the component as part of a larger analysis with spaCy, simply add the pipeline components to an already initialized spaCy pipeline and use extract_df on the Doc object.

To do this we will use the small English pipeline. If you haven’t downloaded it yet it is simple to do:

try:
    nlp = spacy.load("en_core_web_sm")
except:
    # if not downloaded, download the model
    !python -m spacy download en_core_web_sm
    nlp = spacy.load("en_core_web_sm")

nlp.add_pipe("textdescriptives/readability")
nlp.add_pipe("textdescriptives/dependency_distance")

<textdescriptives.components.dependency_distance.DependencyDistance at 0x7f902ea58d50>

From now on, whenever we pass a document through the spacy pipeline (nlp), TextDescriptives will add readability and dependency distance metrics to the document.

Let’s load the data and pass it through the pipeline.

doc = nlp.pipe(df["message"])

# extract the metrics as a dataframe
metrics = td.extract_df(doc, include_text=False)

# join the metrics to the original dataframe
spacy_pipe_metrics_df = df.join(metrics, how="left")
spacy_pipe_metrics_df.head()

	label	message	flesch_reading_ease	flesch_kincaid_grade	smog	gunning_fog	automated_readability_index	coleman_liau_index	lix	rix	...	sentence_length_median	sentence_length_std	syllables_per_token_mean	syllables_per_token_median	syllables_per_token_std	n_tokens	n_unique_tokens	proportion_unique_tokens	n_characters	n_sentences
0	ham	Go until jurong point, crazy.. Available only ...	94.318333	1.760000	3.1291	2.666667	1.449833	4.162000	11.666667	0.333333	...	5.0	2.357023	1.250000	1.0	0.433013	20	20	1.000000	92	3
1	ham	Ok lar... Joking wif u oni...	105.090000	-0.653333	NaN	1.200000	-5.800000	-8.026667	3.000000	0.000000	...	3.0	1.000000	1.166667	1.0	0.372678	6	6	1.000000	24	2
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...	97.654524	1.957143	3.1291	3.733333	3.927024	6.858571	23.619048	1.333333	...	12.0	5.249339	1.178571	1.0	0.382993	28	24	0.857143	128	3
3	ham	U dun say so early hor... U c already then say...	116.652500	-1.645000	NaN	2.200000	-4.550000	-3.541818	14.590909	0.500000	...	5.5	0.500000	1.000000	1.0	0.000000	11	9	0.818182	39	2
4	ham	Nah I don't think he goes to usf, he lives aro...	109.040000	1.280000	NaN	5.200000	1.373846	2.276923	13.000000	0.000000	...	13.0	0.000000	1.000000	1.0	0.000000	13	12	0.923077	49	1

5 rows × 28 columns

That’s it! Let’s assure ourselves that the dataframes are identical.

metrics_df.equals(spacy_pipe_metrics_df)

True

Exploratory Data Analysis#

With the metrics extracted, let’s do some quick exploratory data analysis to get a sense of the data

import seaborn as sns

sns.boxplot(x="label", y="lix", data=metrics_df)

<Axes: xlabel='label', ylabel='lix'>

../_images/0d87aeeac921efc40ab67f21412d8c6ee5fc517d72c321ef0eafcbbfb84d6ea5.png

Let’s run a quick test to see if any of our metrics correlate strongly with the label

# encode the label as a boolean
metrics_df["is_ham"] = metrics_df["label"] == "ham"
# compute the correlation between all metrics and the label
metrics_correlations = metrics.corrwith(metrics_df["is_ham"]).sort_values(
    key=abs, ascending=False
)
metrics_correlations[:10]

token_length_std              -0.409000
n_characters                  -0.408325
n_unique_tokens               -0.363069
coleman_liau_index            -0.325169
automated_readability_index   -0.322269
rix                           -0.304089
n_tokens                      -0.298789
token_length_mean             -0.294734
sentence_length_std           -0.290166
dependency_distance_std       -0.285822
dtype: float64

That’s some pretty high correlations - let’s plot a few of them!

# plot a kde plot for the top 3 metrics
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 3, figsize=(10, 5), sharey=False)
for i, metric in enumerate(metrics_correlations.index[:3]):
    sns.kdeplot(metrics_df, x=metric, hue="label", ax=ax[i])

../_images/41e3a68f9f39899c57fe5e5d53e09e1c3589e3f6c0ffd2e5070633f0d147f7f7.png

Cool! We’ve now done a quick analysis of the SMS dataset and found the distributions of e.g. the standard deviation of token length, the number of characters, and the number of unique tokens to be distributed differently between the actual SMS’s and spam.

Next steps could be continue the exploratory data analysis or to build a simple classifier using the extracted metrics.

Introductory Tutorial#

Using TextDescriptives#

The easy way: Extract metrics with extract_metrics#

The configurable way: Add pipes to spaCy#

Exploratory Data Analysis#

The easy way: Extract metrics with `extract_metrics`#