Table of Contents
Introduction To Removing Stop Words Using NLTK in NLP
In this article we are going to discuss about Removing Stop Words Using NLTK. We want to consider the importance of each word in the sentence. English has a lot of filler words that appear very frequently like “and”, “the”, “is”, “at”, “are” and “a”. When doing statistics on text, these words introduce a lot of noise since they appear way more frequently than other words. That’s we want to remove these irrelevant words. Called as “ stop word ”. That is, words that you might want to filter out before doing any statistical analysis.
NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner.
But there’s no standard list of stop words that is appropriate for all applications. The list of words to ignore can vary depending on your application.
Tutorial
The NLTK tool has a predefined list of 'stopwords', called 'corpus'.
nltk.download("stopwords")
once you download we can load the stopwords then import directly from the package.
(nltk.corpus)
from nltk.corpus import stopwords
print(stopwords.words("english"))
Output:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
#this is a stop word dictionary that has already defined by NLTK.
stop_words = set(stopwords.words("english"))
sentence = 'Backgammon is one of the oldest known board games.'
words = nltk.word_tokenize(sentence)
print(word)
Output:
<IPython.core.display.Javascript object>
['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games', '.']
without_stop_words = [word for word in words if not word in stop_words]
print(without_stop_words)
Output:
['Backgammon', 'one', 'oldest', 'known', 'board', 'games', '.']
Conclusion
In this article, understand the stop words in NLP, not in English. Because NLP wants only important words. So, stop words library is useful and also reduces the size of data.