EXTRACT_USING attributes
- Last Updated: May 13, 2026
- 5 minute read
- Semaphore
- Documentation
Several attributes which modify the extraction for EXTRACT_EVIDENCE
Applies to
No restriction on which rule it may be applied to other then there must be an EXTRACT_EVIDENCE (or equivalent) attribute for the rule.
Attribute Name
- extract_using_document_order - Sorts extraction using position in document rather than lexical order (alphabetical)
- extract_using_stemmed_form - replaces extracted tokens from the document with their stem or lemma form
- extract_using_postags - replaces extracted tokens with their determined Part of Speech
- extract_using_normalised_zones - replaces any zone fully included in extraction with the normalised value for that zone
- extract_using_sibling_extraction_names - replaces any part of the extraction with the name of any overlapping sibling extraction
- extract_using_sibling_extraction_values - replaces any part of the extraction with the value of any overlapping sibling extraction
Values
- “1” - Apply the given attribute
- “0” - Don’t apply the given attribute (default)
Description
Each of these extract_using attributes modifies the text that is extracted from the document.
These using attributes only apply to EXTRACT_EVIDENCE extractions - this may be confusing if EXTRACT_NAME attribute is used since this is an EXTRACT_EVIDENCE on all rules except EXPRESSION rules when it is an EXTRACT_NORMALISED and the using attributes will not then apply.
NB EXTRACT_REGEX could be considered as an extract_using attribute however the restriction of only applying to EXTRACT_EVIDENCE does not apply here - the regex is applied to any extraction value.
Details
extract_using_document_order
Extracts evidence using the document position rather than lexical order.
<sentence extract="1" extract_group="sentence">
<any extract_name="words_alphabetical" >
<text data="*" extract_name="words" extract_using_document_order="1" />
</any>
</sentence>
Which groups by sentence and extracts all words in each sentence both in document order and using the default alphabetical order. Note that using document order will mean multiple extractions if the word is repeated (they have a different position in the document).
curl --form-string sandbox="<sentence extract='1' extract_group='sentence'><any extract_name='words_alphabetical'><text data='*' extract_name='words' extract_using_document_order='1' /></any></sentence>" -F body="This is the sentence in order. This is more with more repeated." localhost:5058
Assuming a CS running on localhost:5058 will give
<META name="sentence" value="" score="1.00">
<META name="words" value="This" score="1.00"/>
<META name="words" value="is" score="1.00"/>
<META name="words" value="the" score="1.00"/>
<META name="words" value="sentence" score="1.00"/>
<META name="words" value="in" score="1.00"/>
<META name="words" value="order" score="1.00"/>
<META name="words" value="." score="1.00"/>
<META name="words_alphabetical" value="." score="1.00"/>
<META name="words_alphabetical" value="This" score="1.00"/>
<META name="words_alphabetical" value="in" score="1.00"/>
<META name="words_alphabetical" value="is" score="1.00"/>
<META name="words_alphabetical" value="order" score="1.00"/>
<META name="words_alphabetical" value="sentence" score="1.00"/>
<META name="words_alphabetical" value="the" score="1.00"/>
</META>
<META name="sentence" value="" score="1.00">
<META name="words" value="This" score="1.00"/>
<META name="words" value="is" score="1.00"/>
<META name="words" value="more" score="1.00"/>
<META name="words" value="with" score="1.00"/>
<META name="words" value="more" score="1.00"/>
<META name="words" value="repeated" score="1.00"/>
<META name="words" value="." score="1.00"/>
<META name="words_alphabetical" value="." score="1.00"/>
<META name="words_alphabetical" value="This" score="1.00"/>
<META name="words_alphabetical" value="is" score="1.00"/>
<META name="words_alphabetical" value="more" score="1.00"/>
<META name="words_alphabetical" value="repeated" score="1.00"/>
<META name="words_alphabetical" value="with" score="1.00"/>
</META>
extract_using_stemmed_form
Replaces extracted tokens from the document with their stem or lemma form
<text data="I was here" extract_name="test" extract_using_stemmed_form="1" extract="1" />
Which replaces “was” in the extraction with its lemmatised form
<META name="test" value="I be here" score="1.00" />
This may be useful, particularly for verbs, for collating multiple uses of the same verb.
curl --form-string sandbox="<text data='I {V}' extract_name='test' extract_using_stemmed_form='1' extract='1' extract_assign_foreach_score='test' weight='0.5'/>" -F body="I was here but now I am there" localhost:5058
In this case (assuming you have a CS running on localhost using language packs) you would get a single extraction of “I be” from the 2 occurrences
<META name="test" value="I be" score="0.75" />
...
Note the score is 0.75 which is a foreach applied twice to a base weight of 0.5
extract_using_postags
Very similar to extract_using_stemmed_form except it replaces the token with the determined PoS tag in the extraction
curl --form-string sandbox="<text data='I {V}' extract_name='test' extract_using_postags='1' extract='1' extract_assign_foreach_score='test' weight='0.5'/>" -F body="I was here but now I am there" localhost:5058
Since the PoS tag for “was” and “am” differ (unlike the lemmatised form) these have individual extractions
<META name="test" value="{PRONPERS}{VBPAST}" score="0.50" />
<META name="test" value="{PRONPERS}{VBPRES}" score="0.50" />
extract_using_normalised_zones
Replaces any zone fully included in extraction with the normalised value for that zone
This allows a larger area of extraction to use normalised zone values when applicable rather than just using extract_normalised_form on an expression.
curl --form-string sandbox="<sentence extract_name='sentence' extract='1' ><expression type='DATE' extract_name='DATE' /> </sentence>" -F body="I was here on the 1st September 1982" localhost:5058
Which, assuming you have a CS with discovery of dates enabled on localhost, extracts both the containing sentence and the actual date. By default extract_name on an expression will extract the normalised form giving:
<META name="DATE" value="1982-09-01" score="1.00" />
<META name="sentence" value="I was here on the 1st September 1982" score="1.00" />
By using the extract_using_normalised_zones we can replace the date in the sentence with the ISO 8601 normalised form of the date.
curl --form-string sandbox="<sentence extract_name='sentence' extract='1' extract_using_normalised_zones='1' ><expression type='DATE' extract_name='DATE' /> </sentence>" -F body="I was here on the 1st September 1982" localhost:5058
Giving
<META name="DATE" value="1982-09-01" score="1.00" />
<META name="sentence" value="I was here on the 1982-09-01 " score="1.00" />
...
extract_using_sibling_extraction_names
Replaces any part of the extraction with the name of any overlapping sibling extraction
Similar to using the normalised zone values you may replace overlapping extractions. This replacement is done using sibling extractions rather than just descendant extractions which allows us to arrange whether the replacement extraction itself is extracted or not (by definition any descendant extraction will be extracted itself at the extract=“1” rule site)
curl --form-string sandbox="<sentence extract_name='sentence' extract='1' extract_using_sibling_extraction_names='1' > <expression type='DATE' extract_name='DATE' /></sentence>" -F body="I was here on the 1st September 1989" localhost:5058
Here we are using the descendant extraction called “DATE” so we get:
<META name="DATE" value="1989-09-01" score="1.00" />
<META name="sentence" value="I was here on the [DATE]" score="1.00"/>
This using attribute may seem of little purpose however the main use case envisioned is to redact information in the extractions (as well as using the REDACT rule to redact information in the feedback of the document itself). So in this case if we tag the date as REDACTED (note this makes this a sibling extraction since we don’t actually have a EXTRACT_TAGS in the extraction tree so the tagged phrase range just travels up the tree and is considered a sibling to our sentence extraction)
curl --form-string sandbox="<sentence extract_name='sentence' extract='1' extract_using_sibling_extraction_names='1' > <expression type='DATE' tag='REDACTED' /></sentence>" -F body="I was here on the 1st September 1989" localhost:5058
Giving us
<META name="sentence" value="I was here on the [REDACTED]" score="1.00"/>
Without the actual extraction of the DATE (complete with value) which happened with the descendant extraction
extract_using_sibling_extraction_names
Replaces any part of the extraction with the value of any overlapping sibling extraction
Very similar to using the name of an overlapping extraction you may replace with the value of the extraction rather than the name of the sibling extraction.
Using the name and value for a redaction doesn’t really make sense (you are including the value so it isn’t redacted) however this does have a use case in marking up the value as a particular type which can be useful for systems such as Text Analytics to analyse documents using the type eg “Find all people in documents which are near a date”
curl --form-string sandbox="<sentence extract_name='sentence' extract='1' extract_using_sibling_extraction_values='1' extract_using_sibling_extraction_names='1' > <expression type='DATE' extract_name='DATE' /><expression type='PERSON' extract_name='PERSON'/></sentence>" -F body="Mr Joe Bloggs was here on the 1st September 1989" localhost:5058
Here this is replacing the DATE and the PERSON in the extracted sentence with [DATE{value}] and [PERSON{value}] - without using the extract_using_sibling_extraction_values you would just have the generic [DATE] and [PERSON] values used - use cases vary as to whether you want the actual value or not used as a replacement.
<META name="DATE" value="1989-09-01" score="1.00"/>
<META name="PERSON" value="Joe Bloggs" score="1.00"/>
<META name="sentence" value="Mr [PERSON{Joe Bloggs}] was here on the [DATE{1989-09-01}]" score="1.00"/>