PHRASE
- Last Updated: May 13, 2026
- 3 minute read
- Semaphore
- Documentation
The phrase rule identifies a consecutive sequence of words.
By default a phrase ignores any punctuation within a sentence but cannot cross a sentence boundary.
Use the PUNCTUATION attribute to alter this behaviour.
Attributes
- _KEY
- DATA Introduced Semaphore 3.5
- LABEL
- FOREACH Introduced Semaphore 3.5
- NOT
- PUNCTUATION
- SCALE
- WEIGHT
Children
Since Semaphore 3.5 the DATA attribute is supported for this rule. The provided text will be parsed and each token found will be appended as a child TEXT rule. This was originally intended simply as a convenience to avoid typing but by keeping the phrase in a single data attribute it avoids issues where the tokenisation is not easily determined manually for the text and so has become the preferred approach.
When using a DATA attribute for a PHRASE rule the PHRASE and TEXT rules become interchangeable. Although TEXT rules, strictly speaking, only match a single word/token these are automatically rewritten as a PHRASE rule with appropriate children if multiple words/tokens are used so these two rules become effectively the same.
More complex behaviour can be asked for by writing the children rules explicitly and this used to be heavily used to handle preclusion within rules - so for example with the document fragment
Joe Biden, the current Vice President of the United States, and Dr. Jill Biden visit Finland, Russia and Moldova
We could get a false positive match for “President of the United States” due to the preclusion of the 2 terms. This could be handled by being explicit about the child rules
<phrase>
<text data="Vice" not="1" />
<text data="President" />
<text data="of" />
<text data="the" />
<text data="United" />
<text data="States" />
</phrase>
However it has become far more common to use INTERSECTION rules to implement preclusion since this avoids the complexity / difficulty of parsing the phrase at publish time. So an alternative would be
<intersection>
<phrase data="President of the United States" />
<any not="1" >
<phrase data="Vice President of the United States" />
</any>
</intersection>
The breaking of a phrase into its component parts, whilst obvious in the above case, can be much trickier and in some cases depend on the specific details of the type of tokenisation used. Since the tokenisaion used by CS may be altered at runtime (for example using advanced language packs or standard mode) it is just easier to avoid the complexity entirely by giving CS both phrases in full in the rules XML and letting it parse them as it sees fit rather than attempting to 2nd guess the tokenisation.
Typically writing the rules to say find this phrase but ignore any occurrences where it is fully contained within this other phrase is just more reliable than attempting to break the phrase into its component words at publish time. The old behaviour is still fully available for backwards compatibility but is less often used so PHRASE and TEXT rules are almost always interchangeable now.
Example
Showing how a phrase rule with a data attribute may be written with child rules. In this case the tokenisation is clear but as mentioned above in other cases this is not so clear. To avoid problems advice is to use the data attribute and let CS parse the phrase how it wants rather than trying to guess when writing the rules or during publish when using rule templates.
<phrase data="this is a phrase" />
is equivalent to
<phrase>
<text data="this"/>
<text data="is"/>
<text data="a"/>
<text data="phrase"/>
</phrase>
Example 2
Using the data attribute and with non-default punctuation handling
<phrase punctuation="ignore_in_paragraph" data="across sentence" />
will fire with a document like
This is test data across. Sentence boundary.
Example 3
The following document fragment
Jean-Claude Trichet announced today a rise of 1/2 point in interest rates.
In a separate intervention the governor of the European Central Bank announced that the institution
will keep a firm handle on inflation.
Evaluated against the following rulebase
<phrase weight="80">
<text data="Jean-Claude" />
<text data="Trichet" />
</phrase>
Will fire with score 80
NB in the above case the 1st <text> rule may be changed automatically by CS into a phrase rule (it depends on how hyphenated terms are parsed which can alter at run time)
Again less problems can occur if you use the equivalent
<phrase weight="80" data="Jean-Claude Trichet" />
where possible.