Language Attribute

Save PDF

Last Updated: July 8, 2026
5 minute read

Semaphore
Documentation

Identifies the language that the rulebase (or rule and subsequent children rules if used as an attribute on a rule) will apply to.

Language only really effect the leaf nodes of a ruletree (ie text rules and expression rules). By specifying a language it restricts the matching of text to tokens in a document which have been determined to be in that language. If a stem attribute is specified then the language has to support a valid stemmer otherwise an error is raised and the rulebase will not be loaded.

The artificial (and annoying) use of language to select between available stemming algorithm has been changed.

However the old mechanism is still supported for backwards compatibility.

The problem with the old mechanism was that specifying (for example) language = “en2” for a rulebase meant that stem matches would only happen if the document itself was submitted specifying that it was in language “en2” if “en” or “en1” was specified for the request then no stem matches would occur.

This has now been changed so that the language and stemmer are defined independently for the rulebase - when a document is processed then all appropriate stemmer algorithms for the language will be used.

This means that the stemmer variant may now be set separately for each rulebase (or in fact for a particular rule) and this will be handled transparently.

To specify which stemmer algorithm should be used there are 2 mechanisms - 1 for backwards compatibility and the new way.

The new way is simply to use the stem=“N” attribute to specify which stemmer variant you wish to use

...
<rulebase  language = "english">
...   
 <text data="stemming" stem="2"/>
 <text data="may" stem="3"/>
 <text data="be" stem="1"/>
...
</rulebase>

This will make entries in the rulenet for stemmer variants 1,2 and 3 for language English so when an english document is processed each token will be stemmed with stemmer variant 1,2 and 3 (porter, marathon and morphological) and stemmed token checked against the appropriate search tree (NB a stem=“1” rule will never be checked against a stem=“3” token so will work correctly)

Functionality changed in version [7.12]

Applies to

[7.12] - The language attribute is valid on any rule / node and will be inherited by that nodes children

[Previous versions] The Language attribute is a special case and does not apply to any operators - it is an attribute of the rulebase node which is the parent node for a rulebase.

[7.12] Values

* Any valid language name or ISO 639-1 language code

* “Any” which is a non ISO language which allows rules to be defined which are checked for match with any document in any language sent for classification.

(For backwards compatibility en1,en2 are still supported which are not ISO codes see note below for details)

When using language packs only languages for which a valid language pack is installed are supported (or language “any”).

Pre [7.12] Values

“en” - English
“en1” - English (Stemmer 1) - default for stemming if no language is specified
“en2” - English morphological stemmer
“en3” - English morphological and derivational stemmer
“fr” - French
“it” - Italian
“de” - German
“es” - Spanish
“nl” - Dutch
“pt” - Portuguese
“da” - Danish
“no” - Norwegian
“sv” - Swedish

Standard Mode

The meaning of the Stemmer Variants for english are :-

0 - No stem
1 - Original Porter algorithm (snowball)
2 - Modified Porter algorithm (Marathon stemmer)
3 - Morphological stemmer
4 - Morphological and Derivational (NB with no POS determination derivational is not too useful in practice)

For other languages which were supported in standard mode before (French,Italian,German,Spanish,Dutch,Portuguese,Danish,Norwegian,Swedish) we have :-

0 - No stem
1 - Porter algorithm

For all other languages we have :-

0 - No stem only

The rulebase will generate an error and not be loaded if a stem variant is asked for which is not supported.

Language Pack Mode

When using a language pack only those languages which have been installed on the machine are allowed (or the special language code Any)

A major advantage of a language pack is that it provides part-of-speech determination for the particular language (along with other advanced language specific functionality) - this allows a derivational stemmer (sometimes called a lemmatiser) to work effectively- this avoids the majority of the problems shown by algorithmic stemmers and so (currently) the language packs only support a single stemmer per language.

When using a language pack asking for stem=“2” will mean the same as stem=“1” (ie it isn’t considered an error - this is to support easy swapping between language pack and standard mode without regenerating the rules).

pre [7.12] Function

Identifies the language that the rulebase will apply to. Applying a language to a rulebase has a dual purpose. Firstly it specifies the stemmer which is used for any stemming operations. Secondly it identifies the language to which the rule actually applies.

When a document is classified if a language is specified for the document then only rules from rulebases specifying that specific language are used to classify.

Backwards Compatibility

To support existing rulenets and current practice the new stemmer variant functionality has a backwards compatible mode - this mode is switched on for the processing of every rulebase file (NB per file not just per pak) and is only switched off if new mechanism is obviously being used.

When processing in backwards compatibility mode the default stemmer variant may be set by the language identifier - when stem=“1” is found this means use the default stemmer variant rather than stemmer variant 1.

For example :-

...
<rulebase language="en2">
...
<text data="word" stem="1"/>
...

In this case stem=“1” means use the stemmer variant that was used previously to support language=“en2”

NB this may have particular importance when a language is not specified by the rulebase but is retrieved from the configuration

NB2 Since stemmer variant 0 means “no stem” we have an annoying change that the stemmer variant used by “en2” is actually stemmer variant 3

The backwards compatibility mode is exited by either of 2 conditions:- * 1] a stem=“n” attribute on the rulebase node * 2] a stem=“n” attribute on a rule node where n is > 1

Generally using 1st mechanism is to be preferred since it does not run the risk of mis-interpreting a stem=“1” attribute at the start of the rulebase which has not yet switched off backwards compatibility mode ie

...
<rulebase language="english" stem="1" >
...
<text data="word1" stem="1"/>
<text data="word2" stem="3"/>
...

In this case word1 is processed using stemmer variant 1 (porter) as required - it would be difficult to construct the misinterpreted case except by mistake

...
<rulebase language="en2">
...
<text data="word1" stem="1"/>
<text data="word2" stem="3"/>
<text data="word3" stem="1"/>
...

in this case word1 would be processed using stemmer variant 3 (the variant used for old language=en2) but word3 would be using stemmer variant 1. However this is unlikely to happen since using stem=“3” for the rulebase should mean that we use the new naming of the language in the rulebase eg

...
<rulebase language="english">
...
<text data="word1" stem="1"/>
<text data="word2" stem="3"/>
<text data="word3" stem="1"/>
...

in this case word1 would use stemmer variant 1 as expected (since the default stemmer variant has been set to 1 so it doesn’t matter that it is processed in backwards compatibility mode)

Semaphore Classification Server Rulebase Reference