Powered by Zoomin Software. For more details please contactZoomin

The Semaphore Fact Extraction Framework (FACTS)

Types of facts

  • Last Updated: May 13, 2026
  • 3 minute read
    • Semaphore
    • Documentation

The fact elements of the framework is where, finally, we do the textual matching, finding some contextual text we are interested in treating as our fact. Ultimately, this is all about matching text. The only fact extractor that simply lifts all text in the context, with no matching, is “Captured Fact” – all others somehow use a matching text rule.

There are currently seven types of facts:

  1. Concept Fact
  2. Taxonomy Fact
  3. Wildcard Fact
  4. Entity Fact
  5. Captured Fact
  6. Logical Concept Fact
  7. Logical Taxonomy Fact

In Semaphore, the way such textual matching works in normal classification is by generating textual rules using the labels for the concept. This is still true in the FACTS framework. Such concepts can either be represented as a single element in the extractor’s sequence, or, you can link out from an element to a taxonomy of concepts that you wish to use to. That is, all the concepts in that taxonomy, from the concept you point at downwards, will be looked for as facts in that element’s position in the extractor’s sequence. This is a massively important way to extract facts. Those concepts, both the individual and taxonomic ones, can have alternative labels.

If the fact we are looking for exhibits a pattern, and where coming up with an extensive list of these would be prohibitive, such as facts with numbers in them somewhere, we can use wildcard (regex) facts. These allow us to use simple wildcard variables to match text, including letters and numbers.

If the fact we are looking for exhibits a pattern, but one that is too complex to model using wildcards, then there is the possibility of using built in pattern recognition facts, which we call entities. These entities are typically things like dates, people’s name, organisation’s names, geographic locations, URLs, and so on. Again, these can be useful when enumerating all entities would be prohibitive. The downside to his approach is that it can be less accurate and have poorer recall, than if the effort was put into building a taxonomy of concepts or wildcard patterns in the first place. A full list of all the entities supported out-of-the-box can be found in Entities.

Finally, we can extract the text that is the range of the extractor’s sequence – that is, the text that lies between two points, or lies within a grammatical unit. We do this using a captured fact.

There are many considerations as to why we should favor one over the other. The best, for assorted reasons, tends to be taxonomic. Reasons in favor of taxonomic facts are that they have an ID – and can therefore be unified easily in later processing. They get to use the full natural language processing stack that Semaphore supports, such as stemming, parts of speech, casing, etc. it can also be made to work very accurately.

Entities can be useful for well understood ones like dates and so on. However, sometimes the accuracy of the others such as ORGANIZATION are not sufficiently good to be used in place of a taxonomic approach.

Typically, however, we will have a mixture of all of these. For example, the dosing for a drug will involve a taxonomy of drug names, wildcard expressions to get the amount of dosing if numbers, and the use of taxonomies for the units.

TitleResults for “How to create a CRG?”Also Available inAlert