Stemming and lemmatization are both techniques used in natural language processing (NLP) to reduce words to their base or root forms. They are commonly applied in text preprocessing tasks to simplify words and improve the efficiency and accuracy of various NLP tasks, such as text classification, information retrieval, and sentiment analysis.
Let's go through examples of stemming and lemmatization using a few words.
Stemming Example: Stemming involves removing prefixes and suffixes from words to get their root form. Here's an example using the Porter stemming algorithm:
Original Words:
Stemmed Words:
In this example, the stemming algorithm has removed the "-ing" suffix from all three words to obtain their root forms.
Lemmatization Example: Lemmatization considers the grammatical context and tries to convert words to their base dictionary form. Here's an example using lemmatization with part of speech tagging:
Original Words:
Lemmatized Words:
In this example, the lemmatization process has recognized the part of speech of each word (verb in this case) and converted them to their base forms.
To illustrate the difference further, let's consider a word that behaves differently under stemming and lemmatization due to its part of speech:
Original Word:
Stemmed Word (using Porter Stemming):
Lemmatized Word (considering as adjective):
In this example, stemming doesn't modify the word "better," while lemmatization, when considering it as an adjective, converts it to the base form "good."
Keep in mind that both stemming and lemmatization are used to simplify words, but lemmatization typically considers linguistic information like part of speech, making it more accurate for many applications. Stemming is simpler and faster but can sometimes yield non-real words. The choice between these techniques depends on the specific NLP task and the trade-off between simplicity and accuracy.