EXTRACT
- Last Updated: May 13, 2026
- 5 minute read
- Semaphore
- Documentation
The EXTRACT rule defines a point in the ruletree where information extracted from the document is output by CS in a similar way to the CATEGORY rule for categorisation output.
If the score for the extraction rule is greater than the threshold then CS will add any extracted information to the output for classification.
Score calculation
Scores the combined scores of its children
Evidence calculation
The evidence is the union of all its children’s evidence.
Attribute information
- None required on the same rule - however at least some of the following attributes must be set on this rules children for the extraction to make any sense
- EXTRACT_EVIDENCE
- EXTRACT_GROUP
- EXTRACT_GROUP_KEY
- EXTRACT_NAME
- EXTRACT_REGEX
- EXTRACT_TAGS
Children
Any rule other than those restricted to a specific parent
Essentially the EXTRACT rule and its children define the complete set of rules for the particular extraction.
The EXTRACT rule is generally used for clarity of expression but from CS perspective it is simply a COMBINE rule with the EXTRACT attribute set. This means that if the EXTRACT rule has multiple direct children it aggregates the score in the same way as a bare COMBINE rule would.
What does it do?
The extract rule (or the equivalent EXTRACT attribute) defines a point in the ruletree calculation (known as the extraction site) where extracted data may be added to the output from CS if the score at this point is above the threshold.
Together with its children rules it defines what should be extracted from the document.
Why this syntax?
The extract rule and it’s related attributes was introduced with semaphore 4.0
It is intended as a replacement for the now deprecated TEMPLATE attribute on a CATEGORY rule.
With the old syntax a category could be marked as a template - in this case the name for the category was to be extracted from the document rather than given by the name attribute. The rule which defined what exactly should be considered for extraction was marked with a CAPTURE attribute.
This limited the extraction to a single class (given as an attribute on the category rule).
The new syntax allows the equivalent of the class to be defined at the point (site) where the capture attribute used to be used which allows multiple classes of extraction to be handled with a single extraction tree (and more importantly grouped).
Huh? - just show me
Possibly the easiest way of explaining extraction is to start with a very simple example and then to increase the complexity.
Consider finding all people mentioned in a document - we could do that with the following ruletree
<extract>
<expression type="PERSON" extract_name="PERSON_NAME"/>
</extract>
CS will run the appropriate zoner to identify “PERSON” zones in the document - and this ruletree is simply asking for all the contents of the document which has been zoned as a PERSON’s name to be extracted with a name “PERSON_NAME” and the value whatever is in the document
So applied to
Jean-Claude Trichet announced today a rise of 1/2 point in interest rates.
In a separate intervention the governor of the European Central Bank announced that the
institution will keep a firm handle on inflation.
Will result in the following being part of the CS output
...
<META name="PERSON_NAME" value="Jean-Claude Trichet" score="1.00" />
...
As mentioned above the EXTRACT rule is equivalent to a COMBINE with the EXTRACT rule set so could be written as
<combine extract="1">
<expression type="PERSON" extract_name="PERSON_NAME"/>
</combine>
And in this case since the score and evidence of a COMBINE rule with a single child is exactly the score and evidence of that child we could rewrite this as
<expression extract="1" type="PERSON" extract_name="PERSON_NAME"/>
In general though having the extraction site at the same point as the naming site in the ruletree isn’t too useful since often rules in-between are providing the filtering of the evidence we require.
It is rare that you want all found people’s names extracted – often you are only interested in names which occur in a particular context so fleshing out the above with some extra rules
<extract>
<sentence>
<any>
<text data="works for" />
<text data="employed by" />
</any>
<text data="Smartlogic Ltd" />
<expression type="PERSON" extract_name="SMARTLOGIC_EMPLOYEE" />
</sentence>
</extract>
We end up finding only those people whose names are in the same sentence as “Smartlogic Ltd” and “works for” / “employed by”
So running this on
Joe Bloggs works for Smartlogic Ltd alongside his brother, Fred Bloggs. Joe's wife, Wilma Bloggs, works for a competitor.
Will extract
...
<META name="SMARTLOGIC_EMPLOYEE" value="Fred Bloggs" score="1.00"/>
<META name="SMARTLOGIC_EMPLOYEE" value="Joe Bloggs" score="1.00"/>
...
So we have ended up excluding “Wilma Bloggs” who would have been identified as a person in the document.
Being so specifc about the company name may or may not be what is required - it would be perfectly possible to replace the “Smartlogic Ltd” with an expression of type “COMPANY” which will also have been zoned in the document so we have
<extract>
<sentence>
<any>
<text data="works for" />
<text data="employed by" />
</any>
<expression type="ORGANIZATION" extract_name="COMPANY" />
<expression type="PERSON” extract_name="EMPLOYEE" />
</sentence>
</extract>
Which applied to the same document gives:
...
<META name="COMPANY" value="Smartlogic Ltd" score="1.00"/>
<META name="EMPLOYEE" value="Fred Bloggs" score="1.00"/>
<META name="EMPLOYEE" value="Joe Bloggs" score="1.00"/>
...
Which works just fine on this simple document snippet - however if we make the document a little more complex
Joe Bloggs works for Smartlogic Ltd alongside his brother, Fred Bloggs. Joe's wife, Wilma Bloggs, works for Acme Widgets Inc.
We get
...
<META name="COMPANY" value="Acme Widgets Inc" score="1.00"/>
<META name="COMPANY" value="Smartlogic Ltd" score="1.00"/>
<META name="EMPLOYEE" value="Fred Bloggs" score="1.00"/>
<META name="EMPLOYEE" value="Joe" score="1.00"/>
<META name="EMPLOYEE" value="Joe Bloggs" score="1.00"/>
<META name="EMPLOYEE" value="Wilma Bloggs" score="1.00"/>
...
At this point we have lost the information of which person works for which company from the output
To retain this information we can adjust the ruletree so that the information is grouped by the sentence with the company name in it
<extract>
<sentence extract_group="COMPANY" >
<any>
<text data="works for" />
<text data="employed by" />
</any>
<expression type="ORGANIZATION" extract_name="COMPANY" />
<expression type="PERSON" extract_name="EMPLOYEE" />
</sentence>
</extract>
So we get the following results
...
<META name="COMPANY" value="" score="1.00">
<META name="COMPANY" value="Smartlogic Ltd" score="1.00"/>
<META name="EMPLOYEE" value="Fred Bloggs" score="1.00"/>
<META name="EMPLOYEE" value="Joe Bloggs" score="1.00"/>
</META>
<META name="COMPANY" value="" score="1.00">
<META name="COMPANY" value="Acme Widgets Inc" score="1.00"/>
<META name="EMPLOYEE" value="Joe" score="1.00"/>
<META name="EMPLOYEE" value="Wilma Bloggs" score="1.00"/>
</META>
....
Which is much more useful as output since it retains the important grouping information established in the document
To make this extraction example more like a real world extraction would require several other steps which would start to obscure the use of the syntax which this document is attempting to show.
For further information see Fact Extraction Use Cases which has a more fully worked use case for employee / company extraction.