Tutorials#
To get started using the package, we recommend going through the tutorials in the order listed below. Each tutorial is also a Jupyter notebook which you can download and run locally.
- Introductory Tutorial
- Filtering corpora using Quality
- note that this can take a little while
- All of the dataset is available in the train split
- We can take a look at one of the examples:
- we can filter out these three datasets based on the “source”
- 1. Crease a blank spaCy model with a sentencizer as that’s the only component required for the quality metrics
- however it might be worth filtering out these documents beforehand for very very long documents.
- 2. Add the textdescriptives pipeline
- 3. Apply the pipeline to the legal documents
- 4. Filter out the documents that do not pass the quality
- first we apply the pipeline to the other domains
- extract alpha ratio:
- histogram
- add labels
- examine the first 100 tokens in the first document
- Scikit-learn Integration