Stemming and Lemmatization


Stemming and lemmatization are both techniques used in natural language processing (NLP) to reduce words to their base or root forms. They are commonly applied in text preprocessing tasks to simplify words and improve the efficiency and accuracy of various NLP tasks, such as text classification, information retrieval, and sentiment analysis.

Let's go through examples of stemming and lemmatization using a few words.

Stemming Example: Stemming involves removing prefixes and suffixes from words to get their root form. Here's an example using the Porter stemming algorithm:

Original Words:

  1. Running
  2. Jumps
  3. Swimming

Stemmed Words:

  1. Running -> Run
  2. Jumps -> Jump
  3. Swimming -> Swim

In this example, the stemming algorithm has removed the "-ing" suffix from all three words to obtain their root forms.

Lemmatization Example: Lemmatization considers the grammatical context and tries to convert words to their base dictionary form. Here's an example using lemmatization with part of speech tagging:

Original Words:

  1. Running
  2. Jumps
  3. Swimming

Lemmatized Words:

  1. Running -> Run
  2. Jumps -> Jump
  3. Swimming -> Swim

In this example, the lemmatization process has recognized the part of speech of each word (verb in this case) and converted them to their base forms.

To illustrate the difference further, let's consider a word that behaves differently under stemming and lemmatization due to its part of speech:

Original Word:

  1. Better

Stemmed Word (using Porter Stemming):

  1. Better -> Better

Lemmatized Word (considering as adjective):

  1. Better -> Good

In this example, stemming doesn't modify the word "better," while lemmatization, when considering it as an adjective, converts it to the base form "good."

Keep in mind that both stemming and lemmatization are used to simplify words, but lemmatization typically considers linguistic information like part of speech, making it more accurate for many applications. Stemming is simpler and faster but can sometimes yield non-real words. The choice between these techniques depends on the specific NLP task and the trade-off between simplicity and accuracy.

Stemming and Lemmatization


Enroll Now

  • Python Programming
  • Machine Learning