Linguistic capabilities
- Last Updated: May 13, 2026
- 6 minute read
- Semaphore
- Documentation
Key to table:
- Y = supported
- - = not supported
| Languages | Tokeni-zation | Sentence Boundary | Token Normali-zation | Lemma Lookup | Part-of-Speech Tagging | Disambi-guation | Lemma User Dictio-nary | Segmen-tation User Dictio-nary | Disambi-guation | Disambi-guation | Script Convers-ion | Disambi-guation | Semitic Root |
| Arabic | Y | Y | Y | Y | Y | Y | - | Y | - | - | - | Y | Y |
| Catalan | Y | Y | - | Y | - | - | - | Y | - | - | - | - | - |
| Chinese (simplified and traditional) | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | - | - |
| Czech | Y | Y | Y | Y | Y | Y | Y | Y | - | - | - | - | - |
| Danish | Y | Y | Y | Y | - | - | Y | Y | Y | - | - | - | - |
| Dutch | Y | Y | Y | Y | Y | Y | Y | Y | Y | - | - | - | - |
| English | Y | Y | Y | Y | Y | Y | Y | Y | - | - | - | - | - |
| Estonian | Y | Y | - | Y | - | - | - | Y | - | - | - | - | - |
| Finnish | Y | Y | - | - | - | - | - | Y | - | - | - | Y | - |
| French | Y | Y | Y | Y | Y | Y | Y | Y | - | - | - | - | - |
| German | Y | Y | Y | Y | Y | Y | Y | Y | Y | - | - | - | - |
| Greek | Y | Y | Y | Y | Y | Y | Y | Y | - | - | - | - | - |
| Hebrew | Y | Y | - | Y | Y | Y | - | Y | - | - | - | - | - |
| Hungarian | Y | Y | Y | Y | Y | Y | Y | Y | Y | - | - | - | - |
| Italian | Y | Y | Y | Y | Y | Y | Y | Y | - | - | - | - | - |
| Languages | Tokeni-zation | Sentence Boundary | Token Normali-zation | Lemma Lookup | Part-of-Speech Tagging | Disambi-guation | Lemma User Dictio-nary | Segmen-tation User Dictio-nary | Disambi-guation | Disambi-guation | Script Convers-ion | Disambi-guation | Semitic Root |
| Japanese | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | - | - | - |
| Korean | Y | Y | Y | Y | Y | Y | Y | Y | Y | - | - | - | - |
| Norwegian | Y | Y | Y | Y | - | - | Y | Y | Y | - | - | - | - |
| Persian | Y | Y | Y | Y | Y | - | - | Y | - | - | - | Y | - |
| Polish | Y | Y | Y | Y | Y | Y | Y | Y | - | - | - | - | - |
| Portuguese | Y | Y | Y | Y | Y | Y | Y | Y | - | - | - | - | - |
| Pushto | Y | Y | - | - | - | - | - | Y | - | - | - | - | - |
| Romanian | Y | Y | Y | Y | - | - | - | Y | - | - | - | - | - |
| Russian | Y | Y | Y | Y | Y | Y | Y | Y | - | - | - | - | - |
| Serbian | Y | Y | - | Y | - | - | - | Y | - | - | - | - | - |
| Slovak | Y | Y | - | Y | - | - | - | Y | - | - | - | - | - |
| Spanish | Y | Y | Y | Y | Y | Y | Y | Y | - | - | - | - | - |
| Swedish | Y | Y | Y | Y | - | - | Y | Y | Y | - | - | - | - |
| Thai | Y | Y | Y | Y | - | - | Y | Y | - | - | - | - | - |
| Turkish | Y | Y | Y | Y | - | - | - | Y | - | - | - | - | - |
| Urdu | Y | Y | Y | - | Y | - | - | Y | - | - | - | Y | - |
Extracted Entities
Entity Extraction is the process of discovering and presenting specific entities and facts that occur in unstructured text. Entities denote the names of people, places, things, dates, values, and so forth, that can be extracted from text. An entity is defined as a pairing of a standard form and its type. For example, Winston Churchill/PERSON is an entity in which Winston Churchill is the standard form and PERSON is the type.
The language modules included with the software contain system dictionaries and provide an extensive set of predefined entity types. The extraction process can extract entities using Statistical techniques, pattern-matching (regex) and dictionaries (gazatteers). Extraction classifies each extracted entity by entity type and presents this metadata in a standardized format.
Key to tables:
- S = statistical processor
- G = exact matching processor (gazetteer)
- R = pattern matching processor (regex)
- D = (BETA) deep neural network processor
Standard Entities
These entities are enabled by default in Semaphore.
- LOC = LOCATION: A city, state, country, region, or other location that contains both a population and a government. A geographic place such as a body of water, mountain, park, or address. A structure such as a building or monument.
- ORG = ORGANIZATION: A corporation, institution, government agency, or other group of people defined by an established organizational structure.
- PER = PERSON: A human identified by name, nickname or alias.
- PROD = PRODUCT
- TTL = TITLE: Appellation associated with a person by virtue of occupation, office, birth, or as an honorific
- NAT = NATIONALITY: Reference to a country or region of origin, such as American or Swiss.
- REL = RELIGION: Reference to an organized religion or theology as well as its followers
- CC# = IDENTIFIER:CREDIT_CARD_NUM
- EM = IDENTIFIER:EMAIL
- MONEY = IDENTIFIER:MONEY
- PERS_ID = IDENTIFIER:PERSONAL_ID_NUM
- TEL# = IDENTIFIER:PHONE_NUMBER
- URL = IDENTIFIER:URL
| Language | ISO code | LOC | ORG | PER | PROD | TTL | NAT | REL | CC# | EM | MONEY | PERS ID | TEL# | URL |
| Arabic | ara | S/G/D | S/G/D | S/D | S | G | G | R | R | R | R | R | R | |
| Chinese (script-intensive, simplified and traditional) | zho, zhs, zhs | S/G | S/G | S | S | G | G | R | R | R | R | R | R | |
| Dutch | nld | S | S/G | S | G | R | R | R | R | R | R | |||
| English | eng | S/G | S/R/G/D | S/D | S | S | G | G | R | R | R | R | R | R |
| French | fra | S | S/G | S | S | R | R | R | R | R | R | |||
| German | deu | S | S/G | S | S | R | R | R | R | R | R | |||
| Hebrew | heb | S | S/G | S | R | R | R | R | R | R | ||||
| Hungarian | hun | S/G | S/G | S/G | S | S | R | R | R | R | R | R | ||
| Indonesian | ind | S/G | S/G | S | R | R | R | R | R | R | ||||
| Italian | ita | S | S/G | S | S | R | R | R | R | R | R | |||
| Japanese | jpn | S | S/G | S | S | G | G | R | R | R | R | R | R | |
| Korean | kor | S/D | S/G/D | S/D | S | G | G | S | S | S | G | G | ||
| Malay, Malay Standard | msa, zsm | S/G | S/G | S | R | R | R | R | R | R | ||||
| Persian (Western Farsi, Dari) | fas | S | S/G | S | G | G | G | R | R | R | R | R | R | |
| Portuguese | por | S | S/G | S | S | R | R | R | R | R | R | |||
| Pashto | pus | S | S/G | S | S | R | R | R | R | R | R | |||
| Russian | rus | S | S/G | S | S | G | G | R | R | R | R | R | R | |
| Spanish | spa | S | S/G | S | S | R | R | R | R | R | R | |||
| Urdu | urd | S | S/G | S | G | R | R | R | R | R | R | |||
| Vietnamese | vie | S | S | S | G | G | G | R | R | R | R | R | R |
Additional Entities
An additional set of entities that can be enabled for most languages, mostly Regex processors.
- DIST = IDENTIFIER:DISTANCE
- LATLNG = IDENTIFIER:LATITUDE_LONGITUDE
- UTM = IDENTIFIER:UTM: Geographical coordinates, expressed with the Universal Transverse Mercator System
- DATE = TEMPORAL:DATE
- TIME = TEMPORAL:TIME
| Language | ISO code | DIST | LATLNG | UTM | DATE | TIME |
| Arabic | ara | R | R | R | R | R |
| Chinese (script-intensive, simplified and traditional) | zho, zhs, zhs | R | R | R | R | R |
| Dutch | nld | R | R | R | R | R |
| English | eng | R | R | R | R | R |
| French | fra | R | R | R | R | R |
| German | deu | R | R | R | R | R |
| Hebrew | heb | R | R | R | R | R |
| Hungarian | hun | R | R | R | R | R |
| Indonesian | ind | R | R | R | R | R |
| Italian | ita | R | R | R | R | R |
| Japanese | jpn | R | R | R | R | R |
| Korean | kor | S | ||||
| Malay, Malay Standard | msa, zsm | R | R | R | R | |
| Persian (Western Farsi, Dari) | fas | R | R | R | R | R |
| Portuguese | por | R | R | R | R | R |
| Pashto | pus | R | R | R | R | R |
| Russian | rus | R | R | R | R | R |
| Spanish | spa | R | R | R | R | R |
| Urdu | urd | R | ||||
| Vietnamese | vie | R | R | R |
Comments and glossary
See [CONVERSION ERROR: MISSING FILE ./introduction/terminology.md] for help.
Chinese
There are two standard forms of written Chinese: Simplified Chinese (SC) and Traditional Chinese (TC). SC is used in the People’s Republic of China (PRC), normally employing the GB2312-80 or GBK character set. TC is used in Taiwan, Hong Kong, and Macau, normally employing the Big Five character set. Semaphore supports both forms of Chinese.
Reading
Japanese support includes the rendering of Furigana transcriptions into Hiragana.