DATA
- Last Updated: May 13, 2026
- 5 minute read
- Semaphore
- Documentation
Specifies the text that a text rule is to find in a document (may use wildcards)
Applies to
Values
- “Text” - Text to search for a match in the document
- “*?#~^” - Text including wildcards
- “{expression}” - valid expression enclosed in {}
- “\X” - escapes X to avoid special interpretation of X
- Stemming is also turned on for the labels. (You can turn stemming off in the publishing templates or in the individual label settings.)
and/or
- The label is also handled by a variant generator (see publisher.configurations.variantgenerators).
Valid wildcards are :-
| Wildcard character | Matches | Example |
|---|---|---|
| * | 0-any number of characters | “a*b” matches any word beginning with a and ending with b including “ab” |
| ::: | 1-any if on its own or surrounded by whitespace | “*” matches any token that isn’t blank |
| ? | a single character | “a?b” matches any 3 letter word beginning with a and ending with b |
| ~ | single numeral | “~~/~~” matches 2 digits a forward slash and 2 digits eg 12/34 |
| # | 0-any numerals | “#/#” matches any number of digits seperated by a forward slash eg “1234/” |
| ::: | 1-any if on its own or surrounded by whitespace | “#” matches “1” or “123” but not “A” |
| ::: | ::: | “A#” matches “A123”,“A1” and “A” since # is 0-any in this case |
| ^ | a single uppercase character | "^*" matches any word beginning with an uppercase character |
Valid expressions are :-
| Expression | Purpose | Notes |
|---|---|---|
| {field_start} | matches the start of a field | The field to match may be given by a (possibly inherited) FIELD attribute |
| {field_end} | matches the end of a field | Again may be given by field attribute |
| {document_start} | matches the start of the document | Is actually the start of the first body field in the document (Introduced Semaphore 4.0) |
| {document_end} | matches the end of the document | Is the end of the last body field in the document |
| {sentence_start} | matches the first word of every sentence | (Introduced in Semaphore 4.0) (*) |
| {sentence_end | matches the last word of every sentence | (Introduced in Semaphore 4.0) (*) |
| {paragraph_start} | matches the first word of every paragraph | (Introduced in Semaphore 4.0) (*) |
| {paragraph_end} | matches the last word of every paragraph | (Introduced in Semaphore 4.0) (*) |
| {[POS_TAG]} | Matches any token with the given Part of Speech Tag | see documentation on parts of speech for more detailed information |
| {[POS_TAG]:TEXT} | adds a pos tag restriction to a given text match | The text match may use wildcards - Introduced in Semaphore 4.0) |
| {} | empty set match | always matches anywhere in the document |
| {skip} | inserts a skip child rule | a skip is an optional match for a token at that point in the data |
| {skip:X} | inserts skip child with count [X] | X must be numeric {skip:2} is identical to {skip} {skip} |
Escape character \
\ matches ‘*’ in the document rather than acting as a wildcard
\ matches ‘\’ in the document
\ matches ‘{’ in the document etc
(*} NB the sentence/paragraph start/end matches have a different behaviour to the field start/end matches. They select the first/last word in the sentence/paragraph rather than selecting a hidden token denoting the start/end of the field. This is because sentences/paragraphs are actually zones (determined by a zoner) in the document.
One of the significant differences between zones and fields is that zones are purely annotations on the document and may be arbritarily complex and overlap - unlike fields which are required to be in a tree structure (think xml). Fields thus can have an explicit start / stop token in the document which may be selected. For a zone there is no such token so the first word is selected instead. This may make using these expressions slightly more complex when trying to find a sequence at the start / end of a sentence / paragraph since the {sentence_start} may already select the first word of the sequence so may fail a strict sequence. Typically, to solve this issue, you use a loose sequence (one that is allowed to repeat terms) or an intersection rule.
Examples
<text data="more"/>
Finds any occurrences of the word more in the document
<text data="m?re"/>
Finds the words more/mare/mire etc
<text data="m*re"/>
Finds the words more/mare/mire/mitre etc
<text data="~~~~"/>
Finds any 4 digit numbers in the document
<text data="{field_start}Date"/>
Finds the word Date if it is at the start of a field (see FIELD for more information about fields).
<intersection>
<text data="{sentence_start}" />
<text data="Lorem"/>
</intersection>
Finds the word Lorem if it is at the start of a sentence. Note the use of the <intersection> rule because sentence_start is equivalent to the starting token in the sentence. The sentence_end, paragraph_start and paragraph_end expressions will work similarly.
<text data="a {skip} test" />
Expands to:
<phrase>
<text data="a" />
<skip count="1" />
<text data="test" />
</phrase>
Using the data attribute on rules other than <text>
In these cases the data attribute is parsed and for each token found and a child text rule is appended to the rule.
<phrase data="This is a phrase" />
is expanded to
<phrase>
<text data="This"/>
<text data="is"/>
<text data="a"/>
<text data="phrase"/>
</phrase>
NB Rules other than those listed above are allowed to have the data attribute which is handled in exactly the same manner. However in most cases the resulting behaviour is not very useful.
So for example
<max data="A B" />
would be expanded to
<max>
<text data="A" />
<text data="B" />
</max>
However since there is no weight specified on the child text rules these will either score 100 or 0 depending whether A or B are in the document. So applying a max rule will score 100 if A or B is in the document and 0 otherwise.
What this calculation does is much more obvious if you write it as (the equivalent)
<any data="A B" />
So these other rules are not listed in this documentation. If you do find a case where using one of the unlisted rules with the data attribute is useful could you update this documentation with the use case.
Use of the empty set {} match
The main use of the empty set is to express optionality. This may normally be done using other constructions but often using {} is clearer
Consider
This is some text with a long phrase in it
If we wanted to match “a long phrase” but also wanted to match “a very long phrase” but not match “a too long phrase” ie we want “very” to be matched optionally. The simplest, most obvious, solution is:
<any>
<text data="a very long phrase" />
<text data="a long phrase" />
</any>
However in some cases it is more convenient to maintain a list of optional matches like “very” in “component” form rather than a simple list of valid phrases.
<any label="_:optional_terms" >
<text data="very" />
</any>
<phrase>
<text data="a" />
<link label="_:optional_terms" />
<text data="long" />
<text data="phrase" />
</phrase>
However this will not match “a long phrase” since there is no match for the any child. Using the empty set match here allows this construction to match
<any label="_:optional_terms" >
<text data="very" />
<text data="{}" />
</any>
<phrase>
<text data="a" />
<link label="_:optional_terms" />
<text data="long" />
<text data="phrase" />
</phrase>
Which matches “a very long phrase”, “a long phrase” but not “a too long phrase” as required.