Introductory Tutorial#

Open In Colab

TextDescriptives lets you quickly and easily calculate a large variety of statistics and metrics from texts.

The package includes a number of components that make it easy to only spend time extracting exactly the metrics you care about.

This tutorial introduces some of components available in TextDescriptives and how you can use them to quickly analyse a text corpus. For more information on the components, see the documentation.

Using TextDescriptives#

In this tutorial we’ll use TextDescriptives to get a quick overview of the SMS Spam Collection Data Set. The dataset contains 5572 SMS messages categorized as ham or spam.

To start, let’s load a dataset and get a bit familiar with it.

try:
    import textdescriptives
except:
    !pip install "textdescriptives[tutorials]"
import textdescriptives as td
import spacy

from textdescriptives.utils import load_sms_data

df = load_sms_data()
df.head()
label message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
df["label"].value_counts()
label
ham     4825
spam     747
Name: count, dtype: int64

The easy way: Extract metrics with extract_metrics#

Let’s extract readability and dependency distance metrics using the extract_metrics function, which handles a lot of boilerplate spaCy code for us. If haven’t used spacy before, you can set the parameter lang to the language of your text instead of specifying spacy_model - then TextDescriptives will automatically download and use an appropriate model for you.

# assumes you have downloaded the ´en_core_web_model´
metrics = td.extract_metrics(
    text=df["message"],
    spacy_model="en_core_web_sm",
    metrics=["readability", "dependency_distance"],
)
# alternatively, you can specify the language and (optionally) the model size
# metrics = td.extract_metrics(text=df["message"], lang="en", spacy_model_size="sm", metrics=["readability, dependency_distance"])

# join the metrics to the original dataframe to get the label
metrics_df = df.join(metrics.drop(columns=["text"]))
metrics_df.head()
label message flesch_reading_ease flesch_kincaid_grade smog gunning_fog automated_readability_index coleman_liau_index lix rix ... sentence_length_median sentence_length_std syllables_per_token_mean syllables_per_token_median syllables_per_token_std n_tokens n_unique_tokens proportion_unique_tokens n_characters n_sentences
0 ham Go until jurong point, crazy.. Available only ... 94.318333 1.760000 3.1291 2.666667 1.449833 4.162000 11.666667 0.333333 ... 5.0 2.357023 1.250000 1.0 0.433013 20 20 1.000000 92 3
1 ham Ok lar... Joking wif u oni... 105.090000 -0.653333 NaN 1.200000 -5.800000 -8.026667 3.000000 0.000000 ... 3.0 1.000000 1.166667 1.0 0.372678 6 6 1.000000 24 2
2 spam Free entry in 2 a wkly comp to win FA Cup fina... 97.654524 1.957143 3.1291 3.733333 3.927024 6.858571 23.619048 1.333333 ... 12.0 5.249339 1.178571 1.0 0.382993 28 24 0.857143 128 3
3 ham U dun say so early hor... U c already then say... 116.652500 -1.645000 NaN 2.200000 -4.550000 -3.541818 14.590909 0.500000 ... 5.5 0.500000 1.000000 1.0 0.000000 11 9 0.818182 39 2
4 ham Nah I don't think he goes to usf, he lives aro... 109.040000 1.280000 NaN 5.200000 1.373846 2.276923 13.000000 0.000000 ... 13.0 0.000000 1.000000 1.0 0.000000 13 12 0.923077 49 1

5 rows × 28 columns

The configurable way: Add pipes to spaCy#

If you want to use the component as part of a larger analysis with spaCy, simply add the pipeline components to an already initialized spaCy pipeline and use extract_df on the Doc object.

To do this we will use the small English pipeline. If you haven’t downloaded it yet it is simple to do:

try:
    nlp = spacy.load("en_core_web_sm")
except:
    # if not downloaded, download the model
    !python -m spacy download en_core_web_sm
    nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textdescriptives/readability")
nlp.add_pipe("textdescriptives/dependency_distance")
<textdescriptives.components.dependency_distance.DependencyDistance at 0x7f902ea58d50>

From now on, whenever we pass a document through the spacy pipeline (nlp), TextDescriptives will add readability and dependency distance metrics to the document.

Let’s load the data and pass it through the pipeline.

doc = nlp.pipe(df["message"])
# extract the metrics as a dataframe
metrics = td.extract_df(doc, include_text=False)
# join the metrics to the original dataframe
spacy_pipe_metrics_df = df.join(metrics, how="left")
spacy_pipe_metrics_df.head()
label message flesch_reading_ease flesch_kincaid_grade smog gunning_fog automated_readability_index coleman_liau_index lix rix ... sentence_length_median sentence_length_std syllables_per_token_mean syllables_per_token_median syllables_per_token_std n_tokens n_unique_tokens proportion_unique_tokens n_characters n_sentences
0 ham Go until jurong point, crazy.. Available only ... 94.318333 1.760000 3.1291 2.666667 1.449833 4.162000 11.666667 0.333333 ... 5.0 2.357023 1.250000 1.0 0.433013 20 20 1.000000 92 3
1 ham Ok lar... Joking wif u oni... 105.090000 -0.653333 NaN 1.200000 -5.800000 -8.026667 3.000000 0.000000 ... 3.0 1.000000 1.166667 1.0 0.372678 6 6 1.000000 24 2
2 spam Free entry in 2 a wkly comp to win FA Cup fina... 97.654524 1.957143 3.1291 3.733333 3.927024 6.858571 23.619048 1.333333 ... 12.0 5.249339 1.178571 1.0 0.382993 28 24 0.857143 128 3
3 ham U dun say so early hor... U c already then say... 116.652500 -1.645000 NaN 2.200000 -4.550000 -3.541818 14.590909 0.500000 ... 5.5 0.500000 1.000000 1.0 0.000000 11 9 0.818182 39 2
4 ham Nah I don't think he goes to usf, he lives aro... 109.040000 1.280000 NaN 5.200000 1.373846 2.276923 13.000000 0.000000 ... 13.0 0.000000 1.000000 1.0 0.000000 13 12 0.923077 49 1

5 rows × 28 columns

That’s it! Let’s assure ourselves that the dataframes are identical.

metrics_df.equals(spacy_pipe_metrics_df)
True

Exploratory Data Analysis#

With the metrics extracted, let’s do some quick exploratory data analysis to get a sense of the data

import seaborn as sns

sns.boxplot(x="label", y="lix", data=metrics_df)
<Axes: xlabel='label', ylabel='lix'>
../_images/0d87aeeac921efc40ab67f21412d8c6ee5fc517d72c321ef0eafcbbfb84d6ea5.png

Let’s run a quick test to see if any of our metrics correlate strongly with the label

# encode the label as a boolean
metrics_df["is_ham"] = metrics_df["label"] == "ham"
# compute the correlation between all metrics and the label
metrics_correlations = metrics.corrwith(metrics_df["is_ham"]).sort_values(
    key=abs, ascending=False
)
metrics_correlations[:10]
token_length_std              -0.409000
n_characters                  -0.408325
n_unique_tokens               -0.363069
coleman_liau_index            -0.325169
automated_readability_index   -0.322269
rix                           -0.304089
n_tokens                      -0.298789
token_length_mean             -0.294734
sentence_length_std           -0.290166
dependency_distance_std       -0.285822
dtype: float64

That’s some pretty high correlations - let’s plot a few of them!

# plot a kde plot for the top 3 metrics
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 3, figsize=(10, 5), sharey=False)
for i, metric in enumerate(metrics_correlations.index[:3]):
    sns.kdeplot(metrics_df, x=metric, hue="label", ax=ax[i])
../_images/41e3a68f9f39899c57fe5e5d53e09e1c3589e3f6c0ffd2e5070633f0d147f7f7.png

Cool! We’ve now done a quick analysis of the SMS dataset and found the distributions of e.g. the standard deviation of token length, the number of characters, and the number of unique tokens to be distributed differently between the actual SMS’s and spam.

Next steps could be continue the exploratory data analysis or to build a simple classifier using the extracted metrics.