What is a fact?
- Last Updated: May 29, 2026
- 3 minute read
- Semaphore
- Documentation
What is a Fact?
In the world of unstructured content, a fact is any information that can be identified or inferred from the text contained in a single document.
A fact usually has a specific meaning within a larger context—such as a consuming system, a business, or an information ecosystem. Any information that can be routinely identified from a single document can be considered a fact.
Facts come in many different forms. For example, a document may contain multiple dates, but you may only be interested in one, such as the “Date of approval.” That is the fact you would model in the framework for extraction.
Often, a fact is a simple, contextualized data type or entity, such as a date, person, organization, location, amount, unit, or measure. Sometimes, these entities might also be modeled as concepts in a taxonomy. We will discuss later which representation is preferable. These are typically single facts.
However, we can generalize the concept of a fact and create custom facts using a technique called grouped facts. Grouped facts allow you to combine smaller facts into larger, user-defined facts, which may be composed of any number of more basic, atomic facts arranged hierarchically. For example, an address can be considered a grouped fact called "Address fact," which is itself made up of smaller facts, some of which might also be grouped facts, such as "Street fact."
If we had the address text below:
Alan Flett
Flat 25,
1 Prince of Wales Road,
London NW5 3LW
Then our fact might look like:
ADDRESS:
PERSON: Alan Flett
STREET ADDRESS:
FLAT NUMBER: 25
STREET:
STREET NUMBER: 1
STREET NAME: Acacia Avenue
CITY: London
POSTCODE: NW5 3LP
Here, several facts are present, some of which group others. The grouping facts are ADDRESS, STREET ADDRESS, and STREET. The atomic facts are PERSON, STREET NUMBER, STREET NAME, CITY, and POSTCODE.
In this way, we can build contextualized, or grouped, facts.
There are many different types of facts, but typically, you are looking for text that is either:
- Literally represented in the model: This is usually matched using a concept (on its own or from a taxonomy) and its associated label evidence, as with any other classification strategy. For example, the name of a city, such as “London,” and its alternative labels, such as “The Smoke.”
- Represented in the model as a matching wildcard pattern: This is usually matched using a regex or wildcard pattern, where variables bind to actual text values from the documents. For example, a postcode might be represented as a pattern like “^^# #^^”.
- Represented in the model simply as an entity type: This is usually matched using a trained entity zoner, which has been trained to look for text likely to be of that entity type (e.g., a date, organization, person, etc.).
The idea of grouped facts leads naturally to the idea of grouped contexts; that is, you require a specific context to extract each fact in the correct context of the grouped fact. This allows you to build hierarchies of contexts and facts, enabling very precise and robust fact extraction using certain strategies that will be discussed later.
By using hierarchies of contexts and facts, you can achieve accurate, robust, and flexible extraction.
Next Section: What is a context?