Contents Menu Expand Light mode Dark mode Auto light/dark mode
textdescriptives 2.8.4 documentation
Light Logo Dark Logo

Getting Started

  • Installation
  • Quick Start
  • Using Specific Components
  • Available Attributes
  • Tutorials
    • Introductory Tutorial
    • Filtering corpora using Quality
    • note that this can take a little while
    • All of the dataset is available in the train split
    • We can take a look at one of the examples:
    • we can filter out these three datasets based on the “source”
    • 1. Crease a blank spaCy model with a sentencizer as that’s the only component required for the quality metrics
    • however it might be worth filtering out these documents beforehand for very very long documents.
    • 2. Add the textdescriptives pipeline
    • 3. Apply the pipeline to the legal documents
    • 4. Filter out the documents that do not pass the quality
    • first we apply the pipeline to the other domains
    • extract alpha ratio:
    • histogram
    • add labels
    • examine the first 100 tokens in the first document
    • Scikit-learn Integration
  • News and Changelog
  • Frequently Asked Questions

Components

  • Descriptive Statistics
  • Readability
  • Dependency Distance
  • Part-of-Speech Proportions
  • Quality
  • Coherence
  • Information Theory
  • Extractor
  • GitHub Repository
Back to top
Copyright ©
Made with Sphinx and @pradyunsg's Furo