Anchoring extractors and facts

Save PDF

Last Updated: July 8, 2026
2 minute read

Semaphore
Documentation

Typically, an anchor is a concept with alternative labels, that look for matches in the content, for no other reason than to provide a beginning, or ending, or some mid-point, to some extractor’s sequence. You can use anchors anywhere in the sequence to make that sequence more unique and more likely to only match the correct context in the content. However, clearly, anchors should be used with caution, as not all documents’ content will have the same anchors, if they have them at all. If they do have them, but they use different text, one way to get around that is to add variant alternative labels to the anchor concept. This typically results in large lists of alternative labels for anchors for content that highly variant.

The anchors we have been discussing are simple textual ones. However, we can also “anchor” facts themselves, by using anchoring constraints on them. Such anchoring constraints include constraints such as the fact must start or end a sentence, paragraph, or document. This then allows for only those facts that are both of the correct type and at the correct location, to be extracted. Obviously, we also apply those anchor constraints to actual anchors as well.

Of course, we can have the opposite case – where we are not interested in matching text, and simply want to ignore it. We do this in the framework by skipping over it.

There are four main types of anchor currently:

In Semaphore, the way such textual matching works in normal classification is by generating textual rules using the labels for the concept. This is still true in the FACTS framework. Such concepts can either be represented as a single element in the extractor’s sequence, or, you can link out from an element to a taxonomy of concepts that you wish to use to. That is, all the concepts in that taxonomy, from the concept you point at downwards, will be looked for as anchors in that element’s position in the extractor’s sequence. Those concepts, both the individual and taxonomic ones, can have alternative labels.

If the anchor we want to use exhibits a pattern, and where coming up with an extensive list of these would be prohibitive, such as anchors with numbers in them somewhere, we can use wildcard anchors. These allow us to use simple wildcard variables to match text, including letters and numbers.

If the anchor we want to use exhibits a pattern, but one that is too complex to model using wildcards, then there is the possibility of using built in pattern recognition facts, which we call entities. These entities are typically things like dates, people’s name, organisation’s names, geographic locations, URLs, and so on. Again, these can be useful when enumerating all entities would be prohibitive. The downside to his approach is that it can be less accurate and have poorer recall, than if the effort was put into building a taxonomy of concepts or wildcard patterns in the first place. A full list of all the entities supported out-of-the-box can be found in Entities.

See Anchors for information regarding Anchor Properties.

The Semaphore Fact Extraction Framework (FACTS)

Anchoring extractors and facts

Table of Contents

Anchoring extractors and facts