Regex Attribute
- Last Updated: May 13, 2026
- 2 minute read
- Semaphore
- Documentation
In template mode a category rule does not simply return the “name” attribute if it is scored above the threshold. Instead the text is extracted from descendant rule(s) with the CAPTURE attribute set.
For each data phrase in the document this optional regex replacement will be applied - see here for synposis of regex syntax used.
The syntax as shown above is taken from sed style syntax so will substitute B for all occurrences of A in value given above - note currently sed options /iG etc are not currently supported but may be in the future if required
The grouping of the extracted phrase ranges into distinct firings (with appropriate foreach count) happens after this regex replacement step so this may be used to merge found data into a single firing if appropriate
NB If the extracted phrase range crosses a paragraph boundary remember that a paragraph separator “.\n\n” is present in the data passed to the regex search/replace - When this data is written to the xml output it is part of an xml attribute (a value attribute on the META node). This means the \n is not valid and so is removed. This has been a source of confusion when writing the regex since your search pattern needs to take account of these \n characters (and either leave them or remove as seems appropriate)
Applies to
Values
- Regex replacement to apply to captured data
- eg “s/A/B/” replace A by B
Examples
The following:
<category class="DATES" foreach="1" weight="50" template="1" regex="s/1[6789][0-9][0-9].*/Too Early/">
<expression type="date" capture="1" />
</category>
Evaluated against the following document fragment:
On the morning of Wednesday 21st June 2012 I updated this documentation.
I'm not quite sure on what date it was originally written but am guessing that is was some time in Q2 2009.
On the 3/1/2011 it had been updated which is another way of saying 1st March 2011 or possibly 1/3/2011 if using Advanced Language Packs.
We should ignore 1/12/1993 and 1/3/1932 since we don't care about early dates
Would return:
...
<META name="DATES" value="2011-01-03" score="0.50" />
<META name="DATES" value="2011-03-01" score="0.75" />
<META name="DATES" value="2012-06-21" score="0.50" />
<META name="DATES" value="Too Early" score="0.75" />
...
Note that the above example is to show how distinct values may be merged into a single firing if you wanted to remove dates before 2000 - using a slice restriction on the <expression> would be better than merging the earlier dates using a regular expression:
<category class="DATES" foreach="1" weight="50" template="1" >
<expression type="date" capture="1" data="[2000-01-01:]"/>
</category>