Text Preprocessing: Stop Words and Stemming
Before text data can be vectorized for ML, it must be cleaned: removing common words that carry no signal (stop words) and reducing words to their root forms (stemming or lemmatization). These steps shrink vocabulary size and improve model generalization.
Removing Stop Words
Stop words — "the", "is", "at", "which" — appear in nearly every document and add noise without meaning. Both NLTK and sklearn's CountVectorizer support built-in stop word lists.
Stop Word Removal with NLTK and sklearn
<pre><code class="language-python">import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))
text = "The quick brown fox jumps over the lazy dog"
tokens = text.lower().split()
filtered = [w for w in tokens if w not in stop_words]
print(filtered) # ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
# Or let CountVectorizer handle it
vec = CountVectorizer(stop_words="english")
X = vec.fit_transform([text])</pre>
Stemming and Lemmatization
Stemming aggressively truncates words to a root form ("running" → "run", "studies" → "studi"). Lemmatization uses vocabulary and morphology to find the dictionary base form ("studies" → "study"), producing more interpretable results.
Stemming with PorterStemmer
<pre><code class="language-python">from nltk.stem import PorterStemmer
nltk.download("wordnet")
stemmer = PorterStemmer()
words = ["running", "studies", "historical", "generously"]
print([stemmer.stem(w) for w in words])
# ['run', 'studi', 'histor', 'generous']</pre>
Lemmatization with WordNetLemmatizer
<pre><code class="language-python">from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("studies", pos="v")) # study
print(lemmatizer.lemmatize("historical", pos="a")) # historical
# pos tag required for best results: 'n', 'v', 'a', 'r'</pre>