EXTRACT_USING attributes

Save PDF

Last Updated: May 13, 2026
5 minute read

Semaphore
Documentation

Several attributes which modify the extraction for EXTRACT_EVIDENCE

Applies to

No restriction on which rule it may be applied to other then there must be an EXTRACT_EVIDENCE (or equivalent) attribute for the rule.

Attribute Name

extract_using_document_order - Sorts extraction using position in document rather than lexical order (alphabetical)
extract_using_stemmed_form - replaces extracted tokens from the document with their stem or lemma form
extract_using_postags - replaces extracted tokens with their determined Part of Speech
extract_using_normalised_zones - replaces any zone fully included in extraction with the normalised value for that zone
extract_using_sibling_extraction_names - replaces any part of the extraction with the name of any overlapping sibling extraction
extract_using_sibling_extraction_values - replaces any part of the extraction with the value of any overlapping sibling extraction

Values

“1” - Apply the given attribute
“0” - Don’t apply the given attribute (default)

Description

Each of these extract_using attributes modifies the text that is extracted from the document.

These using attributes only apply to EXTRACT_EVIDENCE extractions - this may be confusing if EXTRACT_NAME attribute is used since this is an EXTRACT_EVIDENCE on all rules except EXPRESSION rules when it is an EXTRACT_NORMALISED and the using attributes will not then apply.

NB EXTRACT_REGEX could be considered as an extract_using attribute however the restriction of only applying to EXTRACT_EVIDENCE does not apply here - the regex is applied to any extraction value.

Details

extract_using_document_order

Extracts evidence using the document position rather than lexical order.

<sentence extract="1" extract_group="sentence">
<any extract_name="words_alphabetical" >
<text data="*" extract_name="words" extract_using_document_order="1" />
</any>
</sentence>

Which groups by sentence and extracts all words in each sentence both in document order and using the default alphabetical order. Note that using document order will mean multiple extractions if the word is repeated (they have a different position in the document).

curl --form-string sandbox="<sentence extract='1' extract_group='sentence'><any extract_name='words_alphabetical'><text data='*' extract_name='words' extract_using_document_order='1' /></any></sentence>" -F body="This is the sentence in order. This is more with more repeated." localhost:5058

Assuming a CS running on localhost:5058 will give

<META name="sentence" value="" score="1.00">
    <META name="words" value="This" score="1.00"/>
    <META name="words" value="is" score="1.00"/>
    <META name="words" value="the" score="1.00"/>
    <META name="words" value="sentence" score="1.00"/>
    <META name="words" value="in" score="1.00"/>
    <META name="words" value="order" score="1.00"/>
    <META name="words" value="." score="1.00"/>
    <META name="words_alphabetical" value="." score="1.00"/>
    <META name="words_alphabetical" value="This" score="1.00"/>
    <META name="words_alphabetical" value="in" score="1.00"/>
    <META name="words_alphabetical" value="is" score="1.00"/>
    <META name="words_alphabetical" value="order" score="1.00"/>
    <META name="words_alphabetical" value="sentence" score="1.00"/>
    <META name="words_alphabetical" value="the" score="1.00"/>
</META>
<META name="sentence" value="" score="1.00">
    <META name="words" value="This" score="1.00"/>
    <META name="words" value="is" score="1.00"/>
    <META name="words" value="more" score="1.00"/>
    <META name="words" value="with" score="1.00"/>
    <META name="words" value="more" score="1.00"/>
    <META name="words" value="repeated" score="1.00"/>
    <META name="words" value="." score="1.00"/>
    <META name="words_alphabetical" value="." score="1.00"/>
    <META name="words_alphabetical" value="This" score="1.00"/>
    <META name="words_alphabetical" value="is" score="1.00"/>
    <META name="words_alphabetical" value="more" score="1.00"/>
    <META name="words_alphabetical" value="repeated" score="1.00"/>
    <META name="words_alphabetical" value="with" score="1.00"/>
</META>

extract_using_stemmed_form

Replaces extracted tokens from the document with their stem or lemma form

   <text data="I was here" extract_name="test" extract_using_stemmed_form="1" extract="1" />

Which replaces “was” in the extraction with its lemmatised form

  <META name="test" value="I be here" score="1.00" />

This may be useful, particularly for verbs, for collating multiple uses of the same verb.

curl --form-string sandbox="<text data='I {V}' extract_name='test' extract_using_stemmed_form='1' extract='1' extract_assign_foreach_score='test' weight='0.5'/>" -F body="I was here but now I am there" localhost:5058

In this case (assuming you have a CS running on localhost using language packs) you would get a single extraction of “I be” from the 2 occurrences

 <META name="test" value="I be" score="0.75" />
 ...

Note the score is 0.75 which is a foreach applied twice to a base weight of 0.5

extract_using_postags

Very similar to extract_using_stemmed_form except it replaces the token with the determined PoS tag in the extraction

curl --form-string sandbox="<text data='I {V}' extract_name='test' extract_using_postags='1' extract='1' extract_assign_foreach_score='test' weight='0.5'/>" -F body="I was here but now I am there" localhost:5058

Since the PoS tag for “was” and “am” differ (unlike the lemmatised form) these have individual extractions

 <META name="test" value="{PRONPERS}{VBPAST}" score="0.50" />
 <META name="test" value="{PRONPERS}{VBPRES}" score="0.50" />

extract_using_normalised_zones

Replaces any zone fully included in extraction with the normalised value for that zone

This allows a larger area of extraction to use normalised zone values when applicable rather than just using extract_normalised_form on an expression.

curl --form-string sandbox="<sentence extract_name='sentence' extract='1' ><expression type='DATE' extract_name='DATE' /> </sentence>" -F body="I was here on the 1st September 1982" localhost:5058

Which, assuming you have a CS with discovery of dates enabled on localhost, extracts both the containing sentence and the actual date. By default extract_name on an expression will extract the normalised form giving:

 <META name="DATE" value="1982-09-01" score="1.00" />
 <META name="sentence" value="I was here on the 1st September 1982" score="1.00" />

By using the extract_using_normalised_zones we can replace the date in the sentence with the ISO 8601 normalised form of the date.

curl --form-string sandbox="<sentence extract_name='sentence' extract='1' extract_using_normalised_zones='1' ><expression type='DATE' extract_name='DATE' /> </sentence>" -F body="I was here on the 1st September 1982" localhost:5058

Giving

 <META name="DATE" value="1982-09-01" score="1.00" />
 <META name="sentence" value="I was here on the 1982-09-01 " score="1.00" />
 ...

extract_using_sibling_extraction_names

Replaces any part of the extraction with the name of any overlapping sibling extraction

Similar to using the normalised zone values you may replace overlapping extractions. This replacement is done using sibling extractions rather than just descendant extractions which allows us to arrange whether the replacement extraction itself is extracted or not (by definition any descendant extraction will be extracted itself at the extract=“1” rule site)

curl --form-string sandbox="<sentence extract_name='sentence' extract='1' extract_using_sibling_extraction_names='1' > <expression type='DATE' extract_name='DATE' /></sentence>" -F body="I was here on the 1st September 1989" localhost:5058

Here we are using the descendant extraction called “DATE” so we get:

   <META name="DATE" value="1989-09-01" score="1.00" />
   <META name="sentence" value="I was here on the [DATE]" score="1.00"/>

This using attribute may seem of little purpose however the main use case envisioned is to redact information in the extractions (as well as using the REDACT rule to redact information in the feedback of the document itself). So in this case if we tag the date as REDACTED (note this makes this a sibling extraction since we don’t actually have a EXTRACT_TAGS in the extraction tree so the tagged phrase range just travels up the tree and is considered a sibling to our sentence extraction)

curl --form-string sandbox="<sentence extract_name='sentence' extract='1' extract_using_sibling_extraction_names='1' > <expression type='DATE' tag='REDACTED' /></sentence>" -F body="I was here on the 1st September 1989" localhost:5058

Giving us

<META name="sentence" value="I was here on the [REDACTED]" score="1.00"/>

Without the actual extraction of the DATE (complete with value) which happened with the descendant extraction

extract_using_sibling_extraction_names

Replaces any part of the extraction with the value of any overlapping sibling extraction

Very similar to using the name of an overlapping extraction you may replace with the value of the extraction rather than the name of the sibling extraction.

Using the name and value for a redaction doesn’t really make sense (you are including the value so it isn’t redacted) however this does have a use case in marking up the value as a particular type which can be useful for systems such as Text Analytics to analyse documents using the type eg “Find all people in documents which are near a date”

curl --form-string sandbox="<sentence extract_name='sentence' extract='1' extract_using_sibling_extraction_values='1' extract_using_sibling_extraction_names='1' > <expression type='DATE' extract_name='DATE' /><expression type='PERSON' extract_name='PERSON'/></sentence>" -F body="Mr Joe Bloggs was here on the 1st September 1989" localhost:5058

Here this is replacing the DATE and the PERSON in the extracted sentence with [DATE{value}] and [PERSON{value}] - without using the extract_using_sibling_extraction_values you would just have the generic [DATE] and [PERSON] values used - use cases vary as to whether you want the actual value or not used as a replacement.

<META name="DATE" value="1989-09-01" score="1.00"/>
<META name="PERSON" value="Joe Bloggs" score="1.00"/>
<META name="sentence" value="Mr [PERSON{Joe Bloggs}] was here on the [DATE{1989-09-01}]" score="1.00"/>

Semaphore Classification Server Rulebase Reference