Extracting facts/structuring unstructured content

Save PDF

Last Updated: July 8, 2026
2 minute read

Semaphore
Documentation

In most enterprises, the majority of valuable information is locked away in unstructured formats---emails, contracts, reports, customer feedback, and more. These documents contain critical facts, but they're difficult to access, analyze, or act on without manual review.

Semaphore solves this by enabling automated fact extraction using a combination of semantic models, rule-based logic, and natural language processing (NLP). This transforms unstructured content into structured, machine-readable data that can power analytics, automation, and AI.

High-Level Overview: From Text to Facts

Semaphore's fact extraction pipeline is built on the same semantic foundation as its classification engine. It uses:

Auto-published extraction rules: Authored by domain experts and deployed via the Knowledge Model Management (KMM) module.
Taxonomies and ontologies: Provide the conceptual framework for identifying and interpreting facts.
Multilingual NLP: Supports tokenization, entity recognition, and pattern matching in multiple languages using language packs.

This allows Semaphore to extract facts like:

" acquired for <$X>"
" experienced on "
" submitted a regarding "

Entity Recognition: Identifying the Who, What, and Where

Semaphore uses Named Entity Recognition (NER) to identify and tag key entities in text, including:

People: Names, roles, authors
Organizations: Companies, agencies, departments
Locations: Cities, countries, facilities
Dates and Times: Event dates, deadlines, timestamps
Monetary Values: Prices, costs, fines, settlements

These entities are linked to concepts in the semantic model, ensuring consistency across documents and languages.

Contextual Relationships: Extracting Meaningful Facts

Beyond identifying entities, Semaphore extracts relationships between them using semantic patterns and rule-based logic. This allows it to capture:

Events: "Company A filed for bankruptcy on March 3rd."
Transactions: "Supplier X delivered 500 units to Warehouse Y."
Obligations: "The tenant must pay rent by the 5th of each month."

These facts are structured as triples (subject--predicate--object) or JSON-like objects, making them easy to store, query, and analyze.

Multilingual NLP and Language Packs

Semaphore supports multilingual fact extraction using language-specific tokenizers, grammars, and dictionaries. Language packs enable:

Accurate tokenization and sentence segmentation
Locale-specific entity recognition (e.g., currency formats, date styles)
Pattern matching in multiple languages

This is essential for global organizations that operate across regions and need consistent fact extraction from content in English, French, German, Spanish, Japanese, and more.

Use Cases Across Industries

Pharmaceuticals

Extract adverse events, trial outcomes, and dosage information from clinical literature.
Automate literature surveillance for pharmacovigilance.

Legal

Identify parties, clauses, and obligations in contracts.
Extract key dates and terms for contract lifecycle management.

Finance

Detect risk indicators in analyst reports and regulatory filings.
Extract transaction details from invoices and audit documents.

Customer Service

Capture complaint types, product references, and escalation triggers from support tickets and emails.

Business Impact

By structuring unstructured content, Semaphore enables:

Faster decision-making with real-time access to critical facts
Improved compliance through automated monitoring and reporting
Enhanced analytics with richer, more accurate data
AI readiness by feeding structured facts into models and dashboards

Semaphore Overview