Appendix - Additional Information

Save PDF

Last Updated: July 8, 2026
14 minute read

Semaphore
Documentation

Document Text Extraction

When processing a submitted document Classification Server will first attempt to extract text then use this extracted text in the classification process. Without any meaningful textual content the classification process is irrelevant, for example, there is no way to classify an image file because the only text the file generally contains is meta information encoded within the file structure (Classification Server does not do OCR). With files like this it is perfectly valid to either see an error generated by the Classification Server request or no classifications returned from the process at all. Indeed, it is quite common in a correctly configured production installation to regularly see errors generated for certain documents.

If Classification Server is unable to extract text this could be due to any of the following situations:

Classification Server does not handle the specific file format/version that is being used - Semaphore uses a third party library for content extraction that is very good but there are some formats that are simply not handled. The general list of file formats which this third party handles can be found below but note that support for things like specific versions of the file format (that are generated by different versions of applications) may differ. This results in errors such as “Unspecified error reading”, and “Type identified correctly but no available filter to handle content”.
Classification Server cannot extract text from secured documents - Often for security and technical reasons Semaphore cannot extract text from documents that have been secured with a password or encrypted in some way. This results in errors such as “Password Protected or encrypted file FILENAME” and “Could not open the FILENAME” file FILENAME reason: This document requires authentication“. - Classification Server cannot parse the file provided - Even if the file can be opened in the application that generated it the third party libraries may not be able to support the specific information included in it. Often this can be because the third party does not handle all variations of the format that are available or that the published specification used does not include it (that is, the application is generating information in a format the third party does not know about). This results in errors such as ”Unknown exception caught!“, ”Can not open the FILETYPE file FILENAME : The file is damaged and could not be repaired“, ”Unspecified error reading“, ”Failed to obtain content handle for FILENAME“, ”Adobe PDF parse Failed on page X“, ”File is invalid xml“, ”XML error” and “Unspecified error reading embedded object”.

Specific File Format Content Extraction Notes

The following are some comments regarding what content is extracted from specific file types:

“SWF” is generally a Adobe Flash file which Semaphore can only extract a very small amount of textual information, if any at all. There is generally no textual content that can meaningfully be used for classification.
“MP3” is generally an audio format that Semaphore can only extract a small amount of textual information, if any at all, mostly based on the ID3 tags. There is generally no textual content that can meaningfully be used for classification.
“MPG” is generally a video format that Semaphore can only extract a small amount of textual information, if any at all, mostly based on file id only. There is generally no textual content that can meaningfully be used for classification.
“OVG” is generally an “image overlay” file format that Semaphore does not support. With such files, even if Semaphore did support them, there is generally no textual content it can meaningfully use for classification.
“MP4” is generally an audio or video format that Semaphore can only extract a small amount of textual information from, if any at all, mostly based on the metadata. There is generally no textual content that can meaningfully be used for classification.
“TIF” is generally an image format that Semaphore can only extract a small amount of textual information, if any at all, mostly based on the metadata. There is generally no textual content that can meaningfully be used for classification.
“MSG” is generally an email format that may contain elements Semaphore cannot process such as (some) embedded objects.
“CHM” is generally aa “Microsoft Compiled Help” file which is a archive-based format. In archive formats, such as ZIP, Semaphore can only process file names within the file and not the files contained in the archive themselves so classification of such content is generally not advisable as the results are not useful (even if they are generated).
“DGN” is a CAD file format which Semaphore does not support. From such formats (much like any image file) there is generally no textual content that can meaningfully be used for classification..
“RAR” and “ZIP” is generally an archive-based format. In archive formats Semaphore can only process file names within the file and not the files contained in the archive themselves so classification of such content is generally not advisable as the results are not useful (even if they are generated).
“PUB” is generally a Microsoft Publisher file that Semaphore can only extract a small amount of textual information, if any at all, mostly based on file id only. There is generally no textual content that can meaningfully be used for classification.

For those file types Semaphore cannot handle or from which little meaningful data can be extracted it is recommended that you exclude them from classification.

Supported document formats

The following is a sample of the many file formats supported by Classification Server. Contact Progress if a format you require is not listed (as it very well may be supported).

Note: Although many formats are supported please be aware that the information present that is used by Semaphore is limited to the text-only content present in the format (including meta data). For example, labels on images may very well not be extracted as text content.

Archive	Version
7z (BZIP2 and split archives not supported)
7z Self Extracting exe (BZIP2 and split archives not supported)
LZA Self Extracting Compress
LZH Compress
Microsoft Office Binder	95-97
Microsoft Cabinet (CAB)
RAR	1.5, 2.0, 2.9
Self-extracting .exe
UNIX Compress
UNIX GZip
UNIX tar
Uuencode
Zip	PKZip
Zip	WinZip
Zip	Zip64

Database	Version
DataEase	4.x
DBase	III, IV, V
First Choice DB	Through 3.0
Framework DB	3.0
Microsoft Access	1.0, 2.0, 95-2013
Microsoft Access Report Snapshot (File ID only)	2000 - 2003
Microsoft Works DB for DOS	2.0
Microsoft Works DB for Macintosh	2.0
Microsoft Works DB for Windows	3.0, 4.0
Microsoft Works DB for DOS	1.0
Paradox for DOS	2.0 - 4.0
Paradox for Windows	1.0
Q&A Database	Through 2.0
R:Base	R:Base 5000
R:Base	R:Base System V
Reflex	2.0
SmartWare II DB	1.02

Email	Version
Apple Mail Message (EMLX)	2.0
Encoded mail messages	MHT
Encoded mail messages	Multi Part Alternative
Encoded mail messages	Multi Part Digest
Encoded mail messages	Multi Part Mixed
Encoded mail messages	Multi Part News Group
Encoded mail messages	Multi Part Signed
Encoded mail messages	TNEF
EML with Digital Signature	SMIME
IBM Lotus Notes Domino XML Language DXL	8.5
IBM Lotus Notes NSF (File ID)	7.x, 8.x
IBM Lotus Notes NSF (Windows, Linux x86-32 and Oracle Solaris 32-bit only with Notes Client or Domino Server)	8.x
MBOX Mailbox	RFC 822
Microsoft Outlook (MSG)	97 - 2013
Microsoft Outlook Express (EML)
Microsoft Outlook Forms Template (OFT)	97 - 2013
Microsoft Outlook OST	97 - 2013
Microsoft Outlook PST	97 - 2013
Microsoft Outlook PST(Mac)	2001
MSG with Digital Signature	SMIME

Multimedia	Version
AVI (Metadata extraction only)
Flash (text extraction only)	6.x, 7.x, Lite
Flash (File ID only)	9, 10
Real Media (File ID only)
MP3 (ID3 metadata only)
MPEG-1 Audio layer 3 V ID3 v1 (File ID only)
MPEG-1 Audio layer 3 V ID3 v2 (File ID only)
MPEG-1 Video V 2 (File ID only)
MPEG-1 Video V 3 (File ID only)
MPEG-2 Audio (File ID only)
MPEG-4 (Metadata extraction only)
MPEG-7 (Metadata extraction only)
QuickTime (Metadata extraction only)
Windows Media ASF (Metadata extraction only)
Windows Media DVR-MS (Metadata extraction only)
Windows Media Audio WMA (Metadata extraction only)
Windows Media Playlist (File ID only)
Windows Media Video WMV (Metadata extraction only)
WAV (Metadata extraction only)

Other Formats	Version
AOL Messenger (File ID only)	7.3
Microsoft InfoPath (File ID only)	2007
Microsoft Live Messenger (via XML filter)	10.0
Microsoft Office Theme files (File ID only)	2007-2013
Microsoft OneNote (File ID only)	2007, 2010, 2013
Microsoft Project (table view only)	98 - 2003
Microsoft Project (table view only)	2007, 2010
Microsoft Windows Compiled Help (File ID only)	.chm
Microsoft Windows DLL
Microsoft Windows Executable
Microsoft Windows Explorer Command (File ID only)	.scf
Microsoft Windows Help (File ID only)	.hlp
Microsoft Windows Shortcut (File ID only)	.lnk
Trillian Text Log File (via text filter)	4.2
Trillian XML Log File (File ID only)	4.2
TrueType Font (File ID only)	ttf, ttc
vCalendar	2.1
vCard	2.1
Yahoo! Messenger	6.x - 8

Presentation	Version
Apple iWork Keynote (text and PDF preview)	09
Harvard Graphics Presentation DOS	3.0
IBM Lotus Symphony Presentations	1.x
Kingsoft WPS Presentation	2010
LibreOffice Impress	4.x
Lotus Freelance	1.0-Millennium 9.8
Lotus Freelance for OS/3	2
Lotus Freelance for Windows	95, 97, SmartSuite 9.8
Microsoft PowerPoint for Macintosh	4.0 - 2011
Microsoft PowerPoint for Windows	3.0 - 2013
Microsoft PowerPoint for Windows Slideshow	2007 - 2013
Microsoft PowerPoint for Windows Template	2007 - 2013
Novell Presentations	3.0, 7.0
OpenOffice Impress	1.1, 3.0
Oracle Open Office Impress	3.x
StarOffice Impress	5.2 - 9.0
WordPerfect Presentations	5.1 - X

Raster Image	Version
Adobe Photoshop	4.0
Adobe Photoshop (File ID only)
Adobe Photoshop	CS1-6
CALS Raster (GP4)	Type I
CALS Raster (GP4)	Type II
Computer Graphics Metafile	ANSI
Computer Graphics Metafile	CALS
Computer Graphics Metafile	NIST
Encapsulated PostScript (EPS)	TIFF header Only
GEM Image (Bitmap)
Graphics Interchange Format (GIF)
IBM Graphics Data Format (GDF)	1.0
IBM Picture Interchange Format	1.0
JBIG2	Graphic Embeddings in PDF
JFIF (JPEG not in TIFF format)
JPEG
JPEG 2000	JP2
Kodak Flash Pix
Kodak Photo CD	1.0
Lotus PIC
Lotus Snapshot
Macintosh PICT	BMP only
Macintosh PICT2	BMP only
MacPaint
Microsoft Windows Bitmap
Microsoft Windows Cursor
Microsoft Windows Icon
OS/2 Bitmap
OS/2 Warp Bitmap
Paint Shop Pro (Win32 only)	5.0, 6.0
PC Paintbrush (PCX)
PC Paintbrush DCX (multi-page PCX)
Portable Bitmap (PBM)
Portable Graymap PGM
Portable Network Graphics (PNG)
Portable Pixmap (PPM)
Progressive JPEG
StarOffice Draw	6.x - 9.0
Sun Raster
TIFF	Group 5 & 6
TIFF CCITT	Group 3 & 4
TruVision TGA (Targa)	2.0
Word Perfect Graphics	1.0
WBMP wireless graphics format
X-Windows Bitmap	x10 compatible
X-Windows Dump	x10 compatible
X-Windows Pixmap	x10 compatible
WordPerfect Graphics	2.0 - 10.0

Spreadsheet	Version
Apple iWork Numbers (text and PDF preview)	09
Enable Spreadsheet	3.0 - 4.5
First Choice SS	Through 3.0
Framework SS	3.0
IBM Lotus Symphony Spreadsheets	1.x
Kingsoft WPS Spreadsheets	2010
Lotus 1-2-3	Through Millennium 9.8
Lotus 1-2-3 Charts (DOS and Windows)	Through 5.0
Lotus 1-2-3 for OS/2	2.0
Microsoft Excel Charts	2.x - 2007
Microsoft Excel for Macintosh	98 - 2011
Microsoft Excel for Windows	3.0 - 2013
Microsoft Excel for Windows (text only)	2003 XML
Microsoft Excel for Windows (.xlsb)	2007 - 2013 (Binary)
Microsoft Works SS for DOS	2.0
Microsoft Works SS for Macintosh	2.0
Microsoft Works SS for Windows	3.0, 4.0
Multiplan	4.0
Novell PerfectWorks Spreadsheet	2.0
OpenOffice Calc	1.1 - 3.0
Oracle Open Office Calc	3.x
PFS: Plan	1.0
QuattroPro for DOS	Through 5.0
QuattroPro for Windows	Through X6
SmartWare Spreadsheet
SmartWare II SS	1.02
StarOffice Calc	5.2 - 9.0
SuperCalc	5.0
Symphony	Through 2.0
VP-Planner	1.0

Text & Markup	Version
ANSI Text	7 & 8 bit
ASCII Text	7 & 8 bit
DOS character set
EBCDIC
HTML (CSS rendering not supported)	1.0 - 5.0
IBM DCA/RFT
Macintosh character set
Rich Text Format (RTF)
Unicode Text	3.0 , 4.0
UTF-8
Wireless Markup Language
XML (text only)
XHTML (file ID only)	1.0

Vector Image	Version
Adobe Illustrator	4.0 - 7.0
Adobe Illustrator (PDF Preview only)	9.0, CS1-6
Adobe Illustrator XMP	CS1-6
Adobe InDesign XMP	CS1-6
Adobe InDesign Interchange (XMP only)
Adobe PDF	1.0 - 1.7 (Acrobat 1 - 10)
Adobe PDF Package	1.7 (Acrobat 8 - 10)
Adobe PDF Portfolio	1.7 (Acrobat 8 - 10)
Ami Draw	SDW
AutoCAD Drawing	2.5, 2.6
AutoCAD Drawing	9.0 - 14.0
AutoCAD Drawing	2000i - 2013
AutoShade Rendering	2
Corel Draw	2.0 - 9.0
Corel Draw Clipart	5.0, 7.0
Enhanced Metafile (EMF)
Escher graphics
FrameMaker Graphics (FMV)	3.0 - 5.0
Gem File (Vector)
Harvard Graphics Chart DOS	2.0 - 3.0
Harvard Graphics for Windows
HP Graphics Language	2.0
IGES Drawing	5.1 - 5.3
Micrografx Designer	through 3.1
Micrografx Designer	6.0
Micrografx Draw	through 4.0
Microsoft XPS (Text only)
Novell PerfectWorks Draw	2
OpenOffice Draw	1.1 - 3.0
Oracle Open Office Draw	3.x
Visio (Page Preview mode WMF/EMF)	4.0
Visio	5.0 - 2010
Visio (text only)	2013
Visio XML VSX (File ID only)	2007
Windows Metafile

Word Processing	Version
Adobe FrameMaker (MIF only)	3.0 - 6.0
Adobe Illustrator Postscript	Level 2
Ami
Ami Pro for OS2
Ami Pro for Windows	2.0, 3.0
Apple iWork Pages (text and PDF preview	09
DEC DX	Through 4.0
DEC DX Plus	4.0, 4.1
Enable Word Processor	3.0 - 4.5
First Choice WP	1.0, 3.0
Framework WP	3.0
Hangul	97 - 2007
IBM DCA/FFT
IBM DisplayWrite	2.0 - 5.0
IBM Writing Assistant	1.01
Ichitaro	5.0, 6.0, 8.0 - 13.0, 2004, 2010, 2013
JustWrite	Through 3.0
Kingsoft WPS Writer	2010
Legacy	1.1
Lotus Manuscript	Through 2.0
Lotus WordPro (text only)	9.7, 96 - Millennium 9.8
MacWrite II	1.1
Mass 11	through 8.0
Microsoft Publisher (File ID only)	2003 - 2007
Microsoft Word for DOS	4.0 - 6.0
Microsoft Word for Macintosh	4.0 - 6.0, 98 - 2011
Microsoft Word for Windows	1.0 - 2013
Microsoft Word for Windows (text only)	2003 XML
Microsoft Word for Windows	98-J
Microsoft WordPad
Microsoft Works WP for DOS	2.0
Microsoft Works WP for Macintosh	2.0
Microsoft Works WP for Windows	3.0, 4.0
Microsoft Write for Windows	1.0 - 3.0
MultiMate	Through 4.0
MultiMate Advantage	2.0
Navy DIF
Nota Bene	3.0
Novell PerfectWorks Word Processor	2.0
OfficeWriter	4.0 - 6.0
OpenOffice Writer	1.1 - 3.0
Oracle Open Office Writer	3.x
PC File Doc	5.0
PFS: Write	A, B
Professional Write for DOS	1.0, 2.0
Professional Write Plus for Windows	1.0
Q&A Write	2.0, 3.0
Samna Word IV	1.0 - 3.0
Samna Word IV+
Samsung JungUm Global (File ID only)
Signature	1.0
SmartWare II WP	1.02
Sprint	1.0
StarOffice Writer	5.2 - 9.0
Total Word	1.2
Wang IWP	Through 2.6
WordMarc Composer
WordMarc Composer+
WordMarc Word Processor
WordPerfect for DOS	4.2
WordPerfect for Macintosh	1.02 - 3.1
WordPerfect for Windows	5.1 - X5
Wordstar 2000 for DOS	1.0 - 3.0
Wordstar 2000 for DOS	2.0, 3.0
Wordstar for DOS	3.0 - 7.0
Wordstar for Windows	1.0
XyWrite	Through III+

“classify.py/classify.exe” full command reference

The “options” are as follows:

Option	Description
-h, –help	Show a help message and exit
-s SERVER –server=SERVER	Server to use (defaults to local machine). As of Semaphore 4.0.48 you can specify the full Classification Server URL, if required, in this parameter (for example, if using Semaphore Cloud you can use the CS URL from the Basic API).
-t, –transfer	Transfer file data with request - default is to transfer file data if server is specified else use file:\{full_path\} in request
-n, –notransfer	Specify no transfer of file data - use full path to file in request
-p PORT –port=PORT	Port to use (default is 5058 but default may be changed by setting CS_PORT environment variable)
-r, –recurse	Recurse into subdirectories
–limit=LIMIT	Specifies a limit on the number of files to classify (default is no limit)
–start=START	Specifies the start for files to classify (ie skips the specified number of files) (default is 0)
–repeat=REPEAT	Classify the same documents REPEAT times (default is 1)
-o OUTPUTFILE –outputfile=OUTPUTFILE	Specifies a file to write output to (default is to write to the console)
-f FORMAT –format=FORMAT	Specifies the formating applied to output (default is “XML”). Values allowed for “FORMAT” are “XML”, “CSV” or “HTML”. Use “–FormatFile” to specify an XSLT transform to apply to the output when using “XSL”.
–formatfile=FORMATFILE	Specifies the XSLT Transform that is applied to each classification response - output is simply the concatenation of these outputs - <?xml version=“1.0”?><start/> and <?xml version=“1.0”?><finish/> are passed at the start and finish of bulk (or single) classifications to allow xslt to specify header and footer for output.
–exclude=EXCLUDEFILELIST	Excludes any files listed in EXCLUDEFILELIST file
–filelist=INCLUDEFILELIST	Includes files listed in INCLUDEFILELIST file
–record=RECORDFILE	Appends a list of those files classified to the RECORDFILE file
–nothreads	Don’t use separate threads for the classifications
–use_csreq_files	Use “csreq” files for request options if they exist
–only_csreq_files	Only classify files which have a “csreq” file
–make_csreq_files	Creates “csreq” files on disk rather than sending request to Classification Server.
–cf=CFSERVERS	Compare results between two servers and only output differences. CFSERVERS format is “server:port” (e.g. `localhost:5058`) or (as of Semaphore 4.0.48) the full URL `http://localhost:5058` and multiple instances of this parameter can be specified. Note that if a “port” is not specified the port number on the server is assumed to be 5058.
–cfi=CFISERVERS	Interactive mode version of the “–cf” parameter which displays differences in a GUI (“winmerge” in Windows, gvimdiff in Linux - if installed).
-a cloud_api_key, –cloud-api-key=cloud_api_key	(Semaphore 4.0.48 and later) API Key required to authenticate in Smartlogic Cloud
-c url, –cloud-api-token-generation-url=url	(Semaphore 4.0.48 and later) Url used to generate Api Token for Smartlogic Cloud (defaults to token)

Classification Server request options:

Option	Description
–test	Set request op=“TEST” (default op=“CLASSIFY”). See "Test" Request for details.
–version	Return the version of Classification Server (return the result of a “versions” request).
–statistics	Return usage statistics (return the result of a “statistics” request). See "Stats" (Statistics) Request for details.
–singlearticle	Set request <singlearticle/> option (default is to use Classification Server default)
–multiarticle	Set request <multiarticle/> option (default is to use Classification Server default)
–stylesheet	Set request <stylesheet/> option (default is that no stylesheet applied to output)
–language=LANGUAGE	Set request <language> option to language given (default is to use Classification Server default)
–threshold=THRESHOLD	Set request <threshold> option (default is to use Classification Server default, normally configured as 48)
–title=TITLE	Set request <Title> data (default none, that is, use title from document)
–body=BODY	Set request <body> data (default none, use the body from the document)
–feedback	Set request <feedback/> option (default false)
–feedbacktextonly	Set request <feedback>TEXTONLY</feedback> option (does not mark up returned text)
–min_average_article_pagesize=MINAVERAGEARTICLEPAGESIZE	Set request <min_average_article_pagesize> option (default is to use Classification Server default, normally configured as 1.0)
–num_articles_in_singlepass=NUMARTICLESINSINGLEPASS	Set request <num_articles_in_singlepass> option (default is to use Classification Server default, normally configured as 25)
–char_count_cutoff=CHAR_COUNT_CUTOFF	Set request <char_count_cutoff> option (default is to use Classification Server default, normally configured as 0). If 0, then there is no cut-off.
–meta=META	Specifies meta data to include in request - use ‘X=Y’ which will translate to <META name=‘X’ value=‘Y’/> in the request
–debug=DEBUG	Set request <debug> option (default none).
–use_generated_keys	Set request <use_generated_keys/> option (default none).
–legacy	Set request <legacy/> option (default none). Note CSV and HTML format outputs will fail with legacy format responses.
–classify_large_files	Classifies files more than 50Mb in size.

XML File Submission

Rules written for XML files submitted for classification (directly in rules or via the rulebase templates) can specifically target XML elements present in the file (this is not to be confused with a XML request submitted to Classification Server - this is where the request references, or includes, a separate XML file for classification). For example if the following XML file is referenced in a classification request:

<?xml version="1.0" encoding="UTF-8"?>
<book>
  <author>A B Smith</author>
  <bio_info url="http://www.example.com/absmith">
    <born>1/1/1970</born>
    <facebook>absmith</facebook>
  </bio_info>
  <text>Some text here</text>
  <more_text>Some title text</more_text>
</book>

Classification Server supports the limited use of “XPath”-style references to the fields (see field for further details) present in the XML document so, in this case, we could reference “book/text” in the “field” element of the rulebase to target the “Some text here” text. For example:

<phrase field="book/text" data="Some text here" />

To allow normal rulebases, which almost always use “body” restricted rules, work without modifications the actual XML document elements are placed within the “body” section of the request. So, in our example we could access the same “Some text here” information using “body/book/text”.

For convenient usage of any style of XML document the attribute values are also addressable using the XPath convention of prefixing “@” to refer to an attribute rather than an element name. From the above example:

<text field="bio_info/@url" data="http://www.example.com/absmith" />

For the scoping of the search, attribute values are treated exactly as if they were child elements and so the values of attributes are selected for the scope of a search as well as any child elements. For example:

<phrase field="person" data="Fred Bloggs" />

will find a match in

<person name="Fred Bloggs" />

<person>
  <name>Fred Bloggs</name>
</person>

Though in the first case the field the name is found in is actually “@name” rather than “name” which allows you to distinguish between the two cases if required.

Note: Because of this functionality if you have a “field” value of “body” in the XML file then this could cause confusion as in this case “body” specified as a “field” value could refer to the entire XML document OR the XML element (body) itself (in this case Classification Server will reference the entire file not the specific “body” element within it).

Semaphore Classification and Language Service (CLS)