DISTINCT
- Last Updated: May 13, 2026
- 3 minute read
- Semaphore
- Documentation
New in Semaphore 5.10.1
Rewrites phrase ranges to remove any overlap.
Score calculation
Scores the weight of the rule if any child phrase ranges
Evidence calculation
The evidence is all the phrase ranges from it’s children with any overlap only occurring once
Attribute information
- any attribute
- DISTINCTTYPE
- TYPE
- WEIGHT - the score if any child phraseranges found
Children restrictions
Any rule other than those restricted to a specific parent
The DISTINCT rule has similarities to the UNION rule except that whilst the UNION rule will merge 2 overlapping phrase ranges into a single phrase range DISTINCT will keep 2 phrase ranges (unless the 2nd phrase range is fully contained rather than simply overlapping) but the overlapping part will only occur in one of the phrase ranges.
You can chose which phrase range will contain the overlap by using the DISTINCTTYPE attribute.
Example
<distinct>
<text data="A B C" />
<text data="B C D" />
</distinct>
This has A B C D as a sequence.
Here there are 2 phrase ranges passing up to the rule “A B C” and “B C D”. There is an overlap of “B C” between the 2 phrase ranges. The default behaviour of DISTINCT is to keep the overlap in the rightmost phrase range (left means towards the beginning of the document) so the resulting phrase ranges are “A” and “B C D” with a UNION rule used instead the result would be a single phraserange “A B C D”
A reasonable question is why have this rule in addition to UNION. The motivating reason was to handle extraction grouping where repeated anchors were used and also extracted. If we grouped using a UNION then the grouping was not what was wanted.
<extract>
<sequence extract_group="Section" punctuation="ignore_all" type="text">
<text data="Section #" extract_name="Section number"/>
<skip count="100?" extract_name="Section text"/>
<any>
<text data="Section #"/>
<text data="Appendices"/>
</any>
</sequence>
</extract>
On
Section 1
This is some text for the section.
Section 2
This is some more text for the next section.
Section 3
This is yet some more text for the next and next section.
Appendices
The problem here being that the sequences found from “Section 1” to “Section 2” and from “Section 2” to “Section 3” are overlapping. In many cases the overlap wouldn’t matter but here we want the “anchor” for the section extracted and grouped. Since we are grouping by the <sequence>’s phrase ranges the tagged “Section 2” does occur in the first sequence - and so it is grouped appropriately.
<META name="Section" value="" score="1.00">
<META name="Section number" value="Section 1" score="1.00"/>
<META name="Section number" value="Section 2" score="1.00"/>
<META name="Section text" value="This is some text for the section." score="1.00"/>
</META>
<META name="Section" value="" score="1.00">
<META name="Section number" value="Section 2" score="1.00"/>
<META name="Section number" value="Section 3" score="1.00"/>
<META name="Section text" value="This is some more text for the next section." score="1.00"/>
</META>
<META name="Section" value="" score="1.00">
<META name="Section number" value="Section 3" score="1.00"/>
<META name="Section text" value="This is yet some more text for the next and next section." score="1.00"/>
</META>
Applying a UNION above the <sequence> which was the first thought doesn’t help here since it would run all the <sequences> into a single phrase range and thus group all together which is not what is wanted.
Adding a DISTINCT rule above the SEQUENCE and using that rule as the grouping phrase range solves the problem
<extract>
<distinct extract_group="Section" >
<sequence punctuation="ignore_all" type="text">
<text data="Section #" extract_name="Section number"/>
<skip count="100?" extract_name="Section text"/>
<any>
<text data="Section #"/>
<text data="Appendices"/>
</any>
</sequence>
</distinct>
</extract>
By moving the overlap to the right (towards the end of the document) this gives us the wanted grouping and extractions.
<META name="Section" value="" score="1.00">
<META name="Section number" value="Section 1" score="1.00"/>
<META name="Section text" value="This is some text for the section." score="1.00"/>
</META>
<META name="Section" value="" score="1.00">
<META name="Section number" value="Section 2" score="1.00"/>
<META name="Section text" value="This is some more text for the next section." score="1.00"/>
</META>
<META name="Section" value="" score="1.00">
<META name="Section number" value="Section 3" score="1.00"/>
<META name="Section text" value="This is yet some more text for the next and next section." score="1.00"/>
</META>