Stemming example

In this example, a user wants stemming only for the words take, ride and walk.

  • The acceptable forms of take are take, taken, taking.
  • The acceptable forms of walk are walk, walking, walked.
  • The acceptable forms of ride are ride, ridden, riding.

The stem dictionary (.STM file) might contain the following setup:

  • tak 1 (words with this stem will use the first suffix option)
  • rid 1
  • walk 2 (words with this stem will use the second suffix option)

The suffix dictionary (.SUF file) would contain:

  • e, en, ing, den
  • ing, ed

Based on the stem and suffix dictionaries:

Take, taken, taking would result in tak.
Walk, walking, walked would result in walk.
Ride, ridden, riding would result in rid.

These files are not perfect. For example, IR Expert would change taken to tak because suffix index tak and rid are the same.

You could change the configuration so that the stem dictionary (.STM file) contained:

  • tak 1
  • rid 3 (words with this stem will use the third suffix option)
  • walk 2

and the suffix dictionary (.SUF file) contained:

  • e, en, ing
  • ing, ed
  • e, den, ing

Note Setting up these language support files requires a considerable amount of time. It should only be undertaken by someone fluent in the language, and knowledgeable of word components and pronunciation.

Related topics

IR Expert
How IR Expert evaluates documents for relevance
IR Expert file descriptions
Normals dictionary
Stem dictionary
Stop word file
Suffix file
Creating an IR file