Variant Generators for Rulebase Generation
- Last Updated: May 13, 2026
- 11 minute read
- Semaphore
- Documentation
When publishing data to rulebases, it is usually the case that you wish to generate variants upon the labels in the model - for instance you may wish to expand acronyms so that they match the range of formats present in your document set, or you may want to extend the range of company suffixes that should be used as evidence (plc, ltd, inc, gmbhf etc). Adding these to the model itself would be cumbersome and would make it a lot less readable, so during publishing it is possible to add a range of variant generators to auto-populate this data.
It is possible to limit the labels to which each variant generator applies - see the filtering below - and it is also possible to set a variant generator to replace rather than add to the labels already present. This latter option is set using the boolean flag “replaceLabels”. If this is set, if a variant is generated, then the form from which it is derived will not be output during rulebase generation. (So if you have a hyphen handler and present it with “baseball-player” then only “baseball player” will be output. If you want “baseball-player” and “baseball player” to be output, then ensure that this flag is set to false - the default value) Further variants will not be generated from variants that are marked as do not output.
Variant generators can be configured to run globally (the default out of the box configuration) or can be configured to run on a Config Set by Config Set basis. For the latter, add the “variantGenerators” property and/or the “postVariantGenerationProcessors” and their lists to the Config Set directly. When these are added to the Config Set directly, they will override the default settings found (typically) in the ConfigurationSets.xml file.
Labels that are handled in variant generators do not allow use of wildcard symbols. Instead of considering the wildcard as part of the phrase, the generator splits the wildcard token into its own phrase rule, separating it from the rest of the label.
If you want to use wildcards to catch more general patterns on labels, then the label must not appear in the variants file and just be handled by the model and publishing template. (See data for more information on wildcard use in labels.)
The Regular Expression Variant Generator
The regular expression variant generator is the most flexible of all the variant generators. It looks for a specified search expression in the input form, if such a match is found then it is replaced by the specified replacement expression. The replacement expression can include references back to groups within the search expression.
If you wish to carry out just one search/replace operation, you can use the searchExpression and replaceExpression properties of the bean to specify the transformation that should take place.
If you wish to carry out multiple search/replace operations, you can use the replacementMap property of the bean to specify the mapping from searchExpression to replacementExpression, for example the bean:
<bean id="RegExHandler" class="com.smartlogic.publisher.variantgeneration.RegExVariantGenerator" >
<property name="replacementMap">
<map>
<entry key="cheese" value="pickle" />
<entry key="^([A-Za-z]+) ([A-Za-z]+)quot; value="$2 $1" />
<entry key="([A-Za-z]+) ([0-9]+)" value="$1$2" />
</map>
</property>
<property name="replaceLabels" value="false" />
<property name="caseSensitive" value="false" />
<property name="recursionLimit" value="0" />
<property name="includeIntermediate" value="0" />
</bean>
will create new variants using three search replace operations.
-
The first will generate a new variant for variant that contains the word “cheese” - in the new variant the word “cheese” will be replaced with the word “pickle”. Note, because we have specified that the generation be case insensitive (“caseSensitive” in the above example is set to false) - the word “Cheese” will also be replaced by “pickle”. If you need to preserve case, then you will need to make the replacement “caseSensitive” true and specify the different cases.
-
The second replacement will take all variants that consist of precisely two words (where the words contain no non-alphabetical characters) and generates a new variant where the two words are switched around. “Progress Semaphore” will be output as an additional variant “Semaphore Progress”.
-
The third example will generated a new variant where all terms that contain a word followed by a space followed by a number, with the word and number concatenated so “Windows 2010” will generate a new variant “Windows2010”.
The additional tuning parameters are
- replaceLabels - as above should new variants replace the original values (true) or should they be created in addition (false). (default: false)
- caseSensitive - should the pattern matching be case sensitive (default: true)
- recursionLimit - how many times should the regular expression be applied. Sometimes the output from the regular expression replacement will match the regular expression again. If this is the case should we generated a new variant from this subsequent replacement. Note, this can lead to confusion. In the second example above, the swapping of the words, if the regular expression is applied an even number of times, then the output will the same as the input and so no effect will be seen (default: 5)
- includeIntermediate - if we apply the regular expression recursively, should we generate variants for the intermediate values (default: true)
As part of the regular expression operation, whitespace will be normalized and leading and trailing spaces removed after any substitution.
Like all the handlers, the RegExVariantGenerator can be configured to run globally, or on a Config Set by Config Set basis. To run globally, add the bean to a file imported by the Publisher config (e.g. the RulebaseStructure.xml or the ConfigurationSets.xml file), and add the reference to the bean in the list of “variantGenerators” in ConfigurationSets.xml. For example, the ConfigurationSets.xml file could look like this:
...
<property name="variantGenerators">
<list>
<ref bean="characterEscapingVariantGenerator" />
<ref bean="hyphenHandler" />
<ref bean="andHandler" />
<ref bean="bracketHandler" />
<ref bean="punctuationHandler" />
<ref bean="acronymHandler" />
<ref bean="RegExHandler" />
</list>
</property>
<property name="postVariantGenerationProcessors">
<list>
<bean class="com.smartlogic.publisher.preprocessing.WordTypeProcessor" />
<ref bean="preclusionProcessor" />
</list>
</property>
</bean>
<bean id="RegExHandler" class="com.smartlogic.publisher.variantgeneration.RegExVariantGenerator" >
<property name="replacementMap">
<map>
<entry key="cheese" value="pickle" />
<entry key="^([A-Za-z]+) ([A-Za-z]+)quot; value="$2 $1" />
<entry key="([A-Za-z]+) ([0-9]+)" value="$1$2" />
</map>
</property>
<property name="replaceLabels" value="false" />
<property name="caseSensitive" value="false" />
<property name="recursionLimit" value="0" />
<property name="includeIntermediate" value="true" />
</bean>
...
To set the RegExVariantGenerator on individual Config Sets, the entire bean can be put in the Config Set’s list of variantGenerators:
<property name="variantGenerators">
<list>
<ref bean="characterEscapingVariantGenerator" />
<ref bean="hyphenHandler" />
<ref bean="andHandler" />
<ref bean="bracketHandler" />
<ref bean="punctuationHandler" />
<ref bean="acronymHandler" />
<bean id="RegExHandler" class="com.smartlogic.publisher.variantgeneration.RegExVariantGenerator" >
<property name="replacementMap">
<map>
<entry key="cheese" value="pickle" />
<entry key="^([A-Za-z]+) ([A-Za-z]+)quot; value="$2 $1" />
<entry key="([A-Za-z]+) ([0-9]+)" value="$1$2" />
</map>
</property>
<property name="replaceLabels" value="false" />
<property name="caseSensitive" value="false" />
<property name="recursionLimit" value="0" />
<property name="includeIntermediate" value="0" />
</bean>
</list>
</property>
Note: once you add the “variantsGenerator” property to the Config Set, it will override the default setting found in the ConfigurationSets.xml file. Therefore, be sure to include in the list all of the variant generators you want applied to the Config Set.
The following variant generators are built upon the RegExVariantGenerator in order to give a set of useful handlers that can be configured directly without knowing anything about regular expressions
HyphenHandler
This replaces all “-” characters with spaces for one generated variant, and with nothing for another variant. Therefore “bee-keeper” will generate the two variants “bee keeper” and “beekeeper”. This has one additional property that can be updated, “charactersToReplace” is a list of the strings to remove, default value “-”, “—”, “–”.
AndHandler
This replaces all instances of the word “and” with an ampersand for one variant, and replaces all ampersands with ” and ” for another variant. Therefore “Marks & Spencer” will generate “Marks and Spencer”.
PrefixHandler
This removes all instances of the configured set of “prefixesToStrip” at the start of the label along with any following non-word characters - there must be at least one of these. So if “pre” is defined as a prefix to strip, then “pre-school” will generate “school”. However “preschool” will not generate a variant.
SingleQuoteHandler
This removes all instances the smart quotation characters ‘ or ’ and replaces them with the plain single quotation character ’.
DoubleQuoteHandler
This removes all instances the smart quotation characters “ or ” and replaces them with the plain double quotation character “.
PrefixHandler
This removes all instances of the configured set of “prefixesToStrip” at the start of the label along with any following non-word characters - there must be at least one of these. So if “pre” is defined as a prefix to strip, then “pre-school” will generate “school”. However “preschool” will not generate a variant.
SuffixHandler
This removes all instances of the configured set of “suffixesToStrip” at the end of the label along with any leading non-word characters - there must be at least one of these. So if “PLC” is defined as a suffix to strip, then “Progress Semaphore PLC” will generate the variant “Progress Semaphore”.
PunctuationHandler
This removes all punctuation from the label and replaces it with spaces. This takes one property “punctuationCharacters” which is a single string containing all the punctuation characters to be replaced. (default:“,”)
Other variant handlers not based on the regular expression handler are as follows:
AcronymHandler
This will generate variants whereby any blocks of 2 or more upper case letters are replaced in one variant by “.” separated letters, and in the other by “. ” separated letters. Therefore “Progress Semaphore PLC” will generate “Progress Semaphore P.L.C.” and “Smartlogic Semaphore P. L. C.”
DiacriticRemovalHandler
This will generate variants whereby any diacritic mark on the label will be removed. For instance “Karel Čapek” will be replaced by “Karel Capek”
CharacterEscapingVariantGenerator
This is a special handler. It references the Semaphore settings on the label in Knowledge Model Management and applies the selected choice. Note that the default value of the setting is configured on this handler in the file. The handler details are actually located in the includes/ModelInterface.xml file as that is where all the Semaphore Settings handlers are defined.
BracketHandler
The BracketHandler will generate variants where the original label contains parenthesis. The are a number of configuration parameters avalable
- openingBracketCharacters (default “([{<”) - the set of characters that are to be treated as the opening of a parenthetic clause
- closingBracketCharacters (default “)]}>”) - the set of characters that are to be treated as the closing of a parenthetic clause
- removeGlossBrackets (default true) - should a variant of the label be generated which is the label with the bracket characters removed?
- removeGlossEntirely (default true) - should a variant of the label be generated which is the label with the bracket characters and the text within them removed?
- keepGlossClauses (default false) - should variants of the label be generated that consist only of the contents of bracket sets (if there are multiple bracket sets in a label, then multiple variants will be generated here)?
- keepNonGlossClauses (default false) - should variants of the label be generated that consist of the label fragments separated by bracketed clauses?
For instance, for the label “This (is) a (label)” the following variants can be generated:
- removeGlossBrackets - generate “This is a label”
- removeGlossEntirely - generate “This a”
- keepGlossClauses - generate “is” and “label”
- keepNonGlossClauses - generate “This” and “a”
Each of these four switches can be selected individually.
Examples of all of these handlers are configured in the includes/RulebaseStructure.xml configuration file. However, by default only the characterEscapingHandler, the hyphenHandler, the bracketHandler, the andHandler, the punctuationHandler and the acronymHandler are included in the all terms collection set configuration.
Misspelling variant handler
If you have a number of words in your model that are often misspelt in presented documents, then you can use the misspelling variant handler to define a set of misspellings for any word. The variant handler is defined as follows
<bean id="misspellingVariantGenerator" class="com.smartlogic.publisher.variantgeneration.filebased.MisspellingVariantGenerator">
<property name="csvFileName" value="resources/variants/commonMisspellings.txt" />
<property name="mapFirstOnly" value="false" />
<property name="caseSensitive" value="false" />
</bean>
The csvFileName points to a csv file that contains in column 1 a word that might appear in the model. The following columns contain misspellings of that word. When generating rules, variants of the label will be generated with each word replaced by any of its misspellings. If “mapFirstOnly” is set to “false”, then any occurrence of any word in the set will be replaced by all the other words in the set.
If the expression match is to be case insensitive then set the caseSensitive property to be “false”. If the caseSensitive property is not present, the default value is “true”.
Because the number of variants generated will be the multiple of the number of misspellings for each word contained in the label, there is a property “replacementsLimit” that is the maximum number of misspellings that will be generated. By default this is set to 100.
Another parameter “appliesToLanguage” is expecting a single language code. If this is set, then the misspellings will only be generated for the corresponding language. If you want to have a misspellings implemented across a range of languages, then create one file and one variant generator for each language and add each to the list of variant generators.
For file based handlers, the first row of the csv file may not read. Put dummy text or skip the first row.
Expression Replacement Variant Generator
If you have a number of phrases that need to be replaced in a number of terms with other values, you can use the ExpressionReplacementVariantGenerator. The variant handler is defined as follows
<bean id="phraseVariantGenerator" class="com.smartlogic.publisher.variantgeneration.filebased.ExpressionReplacementVariantGenerator">
<property name="csvFileName" value="config/${model.name}/PhraseVariants.txt" />
<property name="mapFirstOnly" value="false" />
<property name="caseSensitive" value="false" />
</bean>
Other than “replacementsLimit” which is not available here, all parameters are the same as for the misspellingVariantGenerator.
For file based handlers, the first row of the csv file may not read. Put dummy text or skip the first row.
Applying variant generation to particular label types
If you want to apply variant generation to only certain label types, then you can apply a filter.
There are three filters that can be applied. Only variants that pass through all three filters will have variants generated for them. These filters are set by setting the appropriate property on the variant generator.
- appliesToLabelTypes - list of the types of labels that should be included. The values acceptable are “prefLabel” to only include preferred labels, “altLabel” to only include alternative labels (including all sub-types), or “prefLabel|altLabel” to include both label types. Also, alternative label sub-types can be specified and separated by a “|”. If the property is not present, the default value is “prefLabel”.
- appliesToRelationshipTypes - list of the relationship names that should be included if alternative labels are included according to the label type filter. This takes a | or ^ separated list of relationship names, ^ means exclude this relationship type, | means include.
- appliesToWordTypes - list of the word types that should be included (in many times this will not be necessary as, for instance, the acronym handler will only create variants for acronyms anyway). Again this is a | or ^ separate list of word types. See word types for more details of this field.
As an example, it is possible that a model contains people’s details including a full name. These full names may be provided in the form “<first name> <last name>” or in the format “<last name>, <first name>”. If they are provided in the latter format then we would like to look for evidence of the form “<first name> <last name>”. To do this, we don’t need to add the labels to the model, we can use the Regular Expression variant generator to generate them at publish time.
An example configuration for this would be:
<bean id="regexHandler" class="com.smartlogic.publisher.variantgeneration.RegExVariantGenerator" >
<property name="searchExpressions">
<list>
<value>(\w*), (\w*)</value>
</list>
</property>
<property name="replaceExpressions">
<list>
<value>$2 $1</value>
</list>
</property>
</bean>
This looks for any pair of words with a comma immediately after the first and replaces them with the words in reverse order and no comma.
This is a pretty general variant generator that may generate a number of strange evidence terms from labels that are not full name labels. Therefore to restrict it to just full name alternative labels we add the properties
<property name="appliesToLabelTypes" value="altLabel" />
<property name="appliesToRelationshipTypes" value="Full name" />
where “Full name” is the name of the relationship between the concept and its full name labels.