Linguistic capabilities

Save PDF

Last Updated: May 13, 2026
6 minute read

Semaphore
Documentation

Key to table:

Y = supported
- = not supported


Languages	Tokeni-zation	Sentence Boundary	Token Normali-zation	Lemma Lookup	Part-of-Speech Tagging	Disambi-guation	Lemma User Dictio-nary	Segmen-tation User Dictio-nary	Disambi-guation	Disambi-guation	Script Convers-ion	Disambi-guation	Semitic Root
Arabic	Y	Y	Y	Y	Y	Y	-	Y	-	-	-	Y	Y
Catalan	Y	Y	-	Y	-	-	-	Y	-	-	-	-	-
Chinese (simplified and traditional)	Y	Y	Y	Y	Y	Y	Y	Y	Y	Y	Y	-	-
Czech	Y	Y	Y	Y	Y	Y	Y	Y	-	-	-	-	-
Danish	Y	Y	Y	Y	-	-	Y	Y	Y	-	-	-	-
Dutch	Y	Y	Y	Y	Y	Y	Y	Y	Y	-	-	-	-
English	Y	Y	Y	Y	Y	Y	Y	Y	-	-	-	-	-
Estonian	Y	Y	-	Y	-	-	-	Y	-	-	-	-	-
Finnish	Y	Y	-	-	-	-	-	Y	-	-	-	Y	-
French	Y	Y	Y	Y	Y	Y	Y	Y	-	-	-	-	-
German	Y	Y	Y	Y	Y	Y	Y	Y	Y	-	-	-	-
Greek	Y	Y	Y	Y	Y	Y	Y	Y	-	-	-	-	-
Hebrew	Y	Y	-	Y	Y	Y	-	Y	-	-	-	-	-
Hungarian	Y	Y	Y	Y	Y	Y	Y	Y	Y	-	-	-	-
Italian	Y	Y	Y	Y	Y	Y	Y	Y	-	-	-	-	-


Languages	Tokeni-zation	Sentence Boundary	Token Normali-zation	Lemma Lookup	Part-of-Speech Tagging	Disambi-guation	Lemma User Dictio-nary	Segmen-tation User Dictio-nary	Disambi-guation	Disambi-guation	Script Convers-ion	Disambi-guation	Semitic Root
Japanese	Y	Y	Y	Y	Y	Y	Y	Y	Y	Y	-	-	-
Korean	Y	Y	Y	Y	Y	Y	Y	Y	Y	-	-	-	-
Norwegian	Y	Y	Y	Y	-	-	Y	Y	Y	-	-	-	-
Persian	Y	Y	Y	Y	Y	-	-	Y	-	-	-	Y	-
Polish	Y	Y	Y	Y	Y	Y	Y	Y	-	-	-	-	-
Portuguese	Y	Y	Y	Y	Y	Y	Y	Y	-	-	-	-	-
Pushto	Y	Y	-	-	-	-	-	Y	-	-	-	-	-
Romanian	Y	Y	Y	Y	-	-	-	Y	-	-	-	-	-
Russian	Y	Y	Y	Y	Y	Y	Y	Y	-	-	-	-	-
Serbian	Y	Y	-	Y	-	-	-	Y	-	-	-	-	-
Slovak	Y	Y	-	Y	-	-	-	Y	-	-	-	-	-
Spanish	Y	Y	Y	Y	Y	Y	Y	Y	-	-	-	-	-
Swedish	Y	Y	Y	Y	-	-	Y	Y	Y	-	-	-	-
Thai	Y	Y	Y	Y	-	-	Y	Y	-	-	-	-	-
Turkish	Y	Y	Y	Y	-	-	-	Y	-	-	-	-	-
Urdu	Y	Y	Y	-	Y	-	-	Y	-	-	-	Y	-

Extracted Entities

Entity Extraction is the process of discovering and presenting specific entities and facts that occur in unstructured text. Entities denote the names of people, places, things, dates, values, and so forth, that can be extracted from text. An entity is defined as a pairing of a standard form and its type. For example, Winston Churchill/PERSON is an entity in which Winston Churchill is the standard form and PERSON is the type.

The language modules included with the software contain system dictionaries and provide an extensive set of predefined entity types. The extraction process can extract entities using Statistical techniques, pattern-matching (regex) and dictionaries (gazatteers). Extraction classifies each extracted entity by entity type and presents this metadata in a standardized format.

Key to tables:

S = statistical processor
G = exact matching processor (gazetteer)
R = pattern matching processor (regex)
D = (BETA) deep neural network processor

Standard Entities

These entities are enabled by default in Semaphore.

LOC = LOCATION: A city, state, country, region, or other location that contains both a population and a government. A geographic place such as a body of water, mountain, park, or address. A structure such as a building or monument.
ORG = ORGANIZATION: A corporation, institution, government agency, or other group of people defined by an established organizational structure.
PER = PERSON: A human identified by name, nickname or alias.
PROD = PRODUCT
TTL = TITLE: Appellation associated with a person by virtue of occupation, office, birth, or as an honorific
NAT = NATIONALITY: Reference to a country or region of origin, such as American or Swiss.
REL = RELIGION: Reference to an organized religion or theology as well as its followers
CC# = IDENTIFIER:CREDIT_CARD_NUM
EM = IDENTIFIER:EMAIL
MONEY = IDENTIFIER:MONEY
PERS_ID = IDENTIFIER:PERSONAL_ID_NUM
TEL# = IDENTIFIER:PHONE_NUMBER
URL = IDENTIFIER:URL


Language	ISO code	LOC	ORG	PER	PROD	TTL	NAT	REL	CC#	EM	MONEY	PERS ID	TEL#	URL
Arabic	ara	S/G/D	S/G/D	S/D		S	G	G	R	R	R	R	R	R
Chinese (script-intensive, simplified and traditional)	zho, zhs, zhs	S/G	S/G	S		S	G	G	R	R	R	R	R	R
Dutch	nld	S	S/G	S		G			R	R	R	R	R	R
English	eng	S/G	S/R/G/D	S/D	S	S	G	G	R	R	R	R	R	R
French	fra	S	S/G	S		S			R	R	R	R	R	R
German	deu	S	S/G	S		S			R	R	R	R	R	R
Hebrew	heb	S	S/G	S					R	R	R	R	R	R
Hungarian	hun	S/G	S/G	S/G	S	S			R	R	R	R	R	R
Indonesian	ind	S/G	S/G	S					R	R	R	R	R	R
Italian	ita	S	S/G	S		S			R	R	R	R	R	R
Japanese	jpn	S	S/G	S		S	G	G	R	R	R	R	R	R
Korean	kor	S/D	S/G/D	S/D		S	G	G	S	S	S	G	G
Malay, Malay Standard	msa, zsm	S/G	S/G	S					R	R	R	R	R	R
Persian (Western Farsi, Dari)	fas	S	S/G	S		G	G	G	R	R	R	R	R	R
Portuguese	por	S	S/G	S		S			R	R	R	R	R	R
Pashto	pus	S	S/G	S		S			R	R	R	R	R	R
Russian	rus	S	S/G	S		S	G	G	R	R	R	R	R	R
Spanish	spa	S	S/G	S		S			R	R	R	R	R	R
Urdu	urd	S	S/G	S		G			R	R	R	R	R	R
Vietnamese	vie	S	S	S		G	G	G	R	R	R	R	R	R

Additional Entities

An additional set of entities that can be enabled for most languages, mostly Regex processors.

DIST = IDENTIFIER:DISTANCE
LATLNG = IDENTIFIER:LATITUDE_LONGITUDE
UTM = IDENTIFIER:UTM: Geographical coordinates, expressed with the Universal Transverse Mercator System
DATE = TEMPORAL:DATE
TIME = TEMPORAL:TIME


Language	ISO code	DIST	LATLNG	UTM	DATE	TIME
Arabic	ara	R	R	R	R	R
Chinese (script-intensive, simplified and traditional)	zho, zhs, zhs	R	R	R	R	R
Dutch	nld	R	R	R	R	R
English	eng	R	R	R	R	R
French	fra	R	R	R	R	R
German	deu	R	R	R	R	R
Hebrew	heb	R	R	R	R	R
Hungarian	hun	R	R	R	R	R
Indonesian	ind	R	R	R	R	R
Italian	ita	R	R	R	R	R
Japanese	jpn	R	R	R	R	R
Korean	kor	S
Malay, Malay Standard	msa, zsm	R		R	R	R
Persian (Western Farsi, Dari)	fas	R	R	R	R	R
Portuguese	por	R	R	R	R	R
Pashto	pus	R	R	R	R	R
Russian	rus	R	R	R	R	R
Spanish	spa	R	R	R	R	R
Urdu	urd			R
Vietnamese	vie	R			R	R

Comments and glossary

See [CONVERSION ERROR: MISSING FILE ./introduction/terminology.md] for help.

Chinese

There are two standard forms of written Chinese: Simplified Chinese (SC) and Traditional Chinese (TC). SC is used in the People’s Republic of China (PRC), normally employing the GB2312-80 or GBK character set. TC is used in Taiwan, Hong Kong, and Macau, normally employing the Big Five character set. Semaphore supports both forms of Chinese.

Reading

Japanese support includes the rendering of Furigana transcriptions into Hiragana.

Semaphore Classification Server Rulebase Reference