A stemming algorithm is a technique used in Information Retrieval (IR) and some other applications of Natural Language Processing (NLP), which removes suffixes from a word in order to obtain a stem or base form which could be easily matched in databases or documents (Jurafsky 2000). Its use is based on the premise that two words with the same stem have very close semantic content. The several possible variations of the derivatives, inflected forms, gender and number changes, and other phenomena, make the grouping of all the variants under a common stem advisable. Applications that do not take these effects into account may end up with difficulties when comparing queries and documents, or dispersal effects in word frequency calculations.

The Improved Snowball Spanish Stemming Algorithm
This work is based on the Spanish stemming algorithm published by the Snowball project (Snowball 1999). The algorithm starts extracting sections from a word and labeling them as RV and R2. RV is defined as the region of the word that starts after the third letter, or null if not exists. To define R2, R1 needs to be defined. R1 is the region after the first non-vowel following a vowel, or null if not exists. For example, in the word precios, the first non-vowel following a vowel would be the c. Therefore, R1 would be ios. Similarly, in the word bell?simo, the first non-vowel following a vowel is the first l, therefore R1 would be given by l?simo.

R2, on the other hand, is a region that starts after the first non-vowel following a vowel in R1, or null if none exists. In the first example, R2 would be null, since there’s no other letter following the s, the first non-vowel fol- lowing a vowel in R1. In the second example, R2 would be given by imo.

Download pdf A Spanish Stemming Algorithm Implementation in PROLOG and C#