Administer > Database administration > Data persistence > IR Expert > IR Expert file descriptions

IR Expert file descriptions

The following table summarizes information about the IR Expert files.

Description Purpose Naming Convention
stop words Required user file maintained by Service Manager administrator. Contains words that have little or no value to the information retrieval process. For example, prepositions are stop words. You can add or delete stop words as necessary. Changes take effect when you restart Service Manager and regenerate the indexes. [ir_languagefiles_path]language.stp where ir_languagefiles_path and language correspond to start-up parameters.
stem dictionary Required system file for languages other than English and German. Contains word stems from which derivative words are formed, allowing IR Expert to match closely related words. Maintained exclusively by IR Expert. [ir_languagefiles_path]language.stm where ir_languagefiles_path and language correspond to start-up parameters.
suffix dictionary Required system file for languages other than English and German. Contains suffix templates used in stemming. Maintained exclusively by IR Expert. [ir_languagefiles_path]language.suf where ir_languagefiles_path and language correspond to start-up parameters.
normals dictionary Required if the language employs special keyboard characters. You can add or delete Normalization characters as necessary. Changes take effect when you restart Service Manager and regenerate the indexes. The excerpt below shows a typical normalization file. The first two characters of each line become substitutions for the following character or comma-separated characters (in decimal notation).

ae 132,142
oe 148,153
ss 225
ue 129,154
[ir_languagefiles_path]language.nor where ir_languagefiles_path and language correspond to start-up parameters.

Normals dictionary

The normals dictionary, [ir_languagefiles_path]language.NOR, is involved only when there are characters in the language that IR Expert transforms into other characters. For example, in the German language IR Expert changes the umlauted characters. So it changes “ä” into “ae”. You may want to do this to make setting up the stem (.STM) and suffix (.SUF) dictionary files easier.

Stem dictionary

The stem dictionary, [ir_languagefiles_path]language.STM, contains the stem, which is the part of a term used in the IR indices. IR Expert considers each word to have a stem (defined in the .STM file) and many possible suffixes (defined in the .SUF file). For example: For the words go and going, “go” is the stem and “ing” is the suffix. Entries in the .STM file consist of the stem word (go) followed by a blank, and then an index entry in the suffix file (.SUF) would be “go 1“. This index indicates which suffix values are acceptable for the stem word.

Stemming example

In this example, a user wants stemming only for the words take, ride and walk.

  • The acceptable forms of take are take, taken, taking.
  • The acceptable forms of walk are walk, walking, walked.
  • The acceptable forms of ride are ride, ridden, riding.

The stem dictionary (.STM file) might contain the following setup:

  • tak 1 (words with this stem will use the first suffix option)
  • rid 1
  • walk 2 (words with this stem will use the second suffix option)

The suffix dictionary (.SUF file) would contain:

  • e, en, ing, den
  • ing, ed

Based on the stem and suffix dictionaries:

Take, taken, taking would result in tak.
Walk, walking, walked would result in walk.
Ride, ridden, riding would result in rid.

These files are not perfect. For example, IR Expert would change taken to tak because suffix index tak and rid are the same.

You could change the configuration so that the stem dictionary (.STM file) contained:

  • tak 1
  • rid 3 (words with this stem will use the third suffix option)
  • walk 2

and the suffix dictionary (.SUF file) contained:

  • e, en, ing
  • ing, ed
  • e, den, ing

Note: Setting up these language support files requires a considerable amount of time. It should only be undertaken by someone fluent in the language, and knowledgeable of word components and pronunciation.

Stop word file

The stop word file, [ir_languagefiles_path]<language>.stp, contains a list of words used too frequently to index because that would hinder the identification of documents. Edit this list as you would a text file. Type each word on a separate line.

The words in the file go through a stemming process, which eliminates the need to specify all the forms for the word. For example, in English, you do not have to enter both "go" and "going" into the stop words file because the stemming algorithm changes "going" to "go." The only word that must be entered in the .STP file is the word "go."

Suffix file

The suffix file, [ir_languagefiles_path]language.SUF, contains a series of lines, each a list of valid suffix values. The .STM file indicates which line in the .SUF file should be used as the possible suffixes for any given stem word. For example, the stem go suffixes might be “ing”, “es”, or “ne.”

 

Related topics

IR Expert
How IR Expert evaluates documents for relevance
Creating an IR file
Access IR Expert
Load data files with IR Expert keys
Start IR Asynchronous mode