A methodology (content)

Save PDF

Last Updated: July 8, 2026
4 minute read

Semaphore
Documentation

A Methodology

As noted earlier, there are many different types of context extractors and facts to address various fact extraction challenges. We will look at contexts in more detail later. However, there are some general points to consider about low-level, or atomic (not grouped), facts.

The advice that follows depends largely on your particular use case.

Typically, if you want to query facts and unify the same facts across your document corpus, it is best to extract facts using a taxonomy of concepts. This provides a key advantage: identity. Identity is essential if you want to run semantic queries across your facts that describe your content. Concepts also allow for precise matching of their textual labels (preferred and alternative), as a concept acts like a zoner, where its ability to find text is dictated by the labels you assign. This allows for greater accuracy in what is considered a fact, enabling looser contexts that are more robust and flexible to content variations.

Entities discovered using trained approaches can be useful, especially for facts such as dates. However, other entities might not have the accuracy or recall you require for your fact extraction project. In such cases, a taxonomy is preferable; extracted entities without identity are simply text (of a type). Depending on your requirements, a lack of identity might be a deal-breaker. Extracting raw text is the least beneficial for semantic querying or integration, but may be suitable for use cases where only the text is required.

For extracting numerical facts, wildcarded facts can be very useful, as numbers typically have a structure and may include a unit (where those units could come from a taxonomy, creating a grouped fact using the wildcard fact and a taxonomy fact).

You can mix these facts in context hierarchies, using the right one for the right job at the right time. The best approach for your fact extraction project will depend on your requirements for the extracted facts.

Some First Steps

The methodology can be organized into these four steps:

Identify the facts you wish to extract from your content: represent those facts’ abstract schemas.
Identify the content that contains those facts: represent that content as distinct document types.
Create fingerprints that identify those document types: represent those fingerprints using identifying contexts.
Create extractors that extract the facts: represent those extractors using the many specialized context and fact types.

It is crucial to understand your content. Writing extractors becomes much easier once you know your content well.

Let’s take a closer look at each step:

Identify the facts you wish to extract from your content

This is fundamental to the entire process. Which facts are of interest? More importantly, do those facts have structure? Are they complex? Do they have “sub-facts”?

That structure can only be represented as a tree; there is no other ordering relationship apart from “containment” or “hierarchy”—just like any other taxonomy. Therefore, we have parent facts and child facts.

Once you identify all the parent and child facts you wish to extract, and how they are structured, you can represent them in that structure in FACTS.

We do this in the FACT NAMES concept scheme.

Each fact is created as a concept there, of concept class type “Fact Name.”

Child facts can then be created under their parents.

This can be polyhierarchical, as these concepts only provide their preferred label to name an extracted fact or to group extracted facts.

Identify the content that contains those facts

Once you have identified your facts and represented their abstract structure in FACTS, decide which content deserves to be distinguished.

Is there a subset in the content that could be described as different from the rest and shares commonalities with some other subset?

If so, then you have a document type. Such document types are represented in FACTS as concepts under the concept scheme “EXTRACTORS” of concept class “Document Type.”

You can have more than one document type in your content and as many as you wish in FACTS.

These document types do not have to be disjoint, but if they are, that probably best represents your content.

Only one document type is recognized for any document sent to FACTS.

Create fingerprints that identify those document types

Once you have determined how to split the content into recognizably different subsets, the next question is how to identify those subsets for each document sent to CS.

In FACTS, we create fingerprints for this purpose. We represent them as a child concept to the Document Type, as a subclass of Document Metadata, either as a Document Fact or as a Document Anchor.

If you use a Document Anchor, you do not care about extracting the fingerprint for returning from CS. It is only for fingerprinting the document and nothing else. You also won’t know which text it found when fired.
If you use a Document Fact, you do care about extracting the fingerprint for returning it from CS. In this case, you will see the text that was matched when it fired.

Create extractors that extract the facts

This is where the real action is! We will not cover this in detail here; that will be discussed in later sections.

However, some general points are worth noting:

For each document fact, there needs to be at least one extractor.
If there is more than one, you will typically preclude them, that is, order them in terms of preference. This prevents multiple facts from being returned if more than one extractor fires. Usually, you want to use your “best” extractor first.
The extractors instantiate the abstract structure or schema for the facts. The extractors are built around the schema of the fact, its parent and child subfacts if it has them, and the facts they extract should follow the original abstract definition for your facts.

Let’s now turn to more detail in the following sections!

The Semaphore Fact Extraction Framework (FACTS)