Powered by Zoomin Software. For more details please contactZoomin

Semaphore Classification and Language Service (CLS)

Appendix - Additional Information

  • Last Updated: May 13, 2026
  • 14 minute read
    • Semaphore
    • Documentation

Document Text Extraction

When processing a submitted document Classification Server will first attempt to extract text then use this extracted text in the classification process. Without any meaningful textual content the classification process is irrelevant, for example, there is no way to classify an image file because the only text the file generally contains is meta information encoded within the file structure (Classification Server does not do OCR). With files like this it is perfectly valid to either see an error generated by the Classification Server request or no classifications returned from the process at all. Indeed, it is quite common in a correctly configured production installation to regularly see errors generated for certain documents.

If Classification Server is unable to extract text this could be due to any of the following situations:

  1. Classification Server does not handle the specific file format/version that is being used - Semaphore uses a third party library for content extraction that is very good but there are some formats that are simply not handled. The general list of file formats which this third party handles can be found below but note that support for things like specific versions of the file format (that are generated by different versions of applications) may differ. This results in errors such as “Unspecified error reading”, and “Type identified correctly but no available filter to handle content”.
  2. Classification Server cannot extract text from secured documents - Often for security and technical reasons Semaphore cannot extract text from documents that have been secured with a password or encrypted in some way. This results in errors such as “Password Protected or encrypted file FILENAME” and “Could not open the FILENAME” file FILENAME reason: This document requires authentication“. - Classification Server cannot parse the file provided - Even if the file can be opened in the application that generated it the third party libraries may not be able to support the specific information included in it. Often this can be because the third party does not handle all variations of the format that are available or that the published specification used does not include it (that is, the application is generating information in a format the third party does not know about). This results in errors such as ”Unknown exception caught!“, ”Can not open the FILETYPE file FILENAME : The file is damaged and could not be repaired“, ”Unspecified error reading“, ”Failed to obtain content handle for FILENAME“, ”Adobe PDF parse Failed on page X“, ”File is invalid xml“, ”XML error” and “Unspecified error reading embedded object”.

Specific File Format Content Extraction Notes

The following are some comments regarding what content is extracted from specific file types:

  1. “SWF” is generally a Adobe Flash file which Semaphore can only extract a very small amount of textual information, if any at all. There is generally no textual content that can meaningfully be used for classification.
  2. “MP3” is generally an audio format that Semaphore can only extract a small amount of textual information, if any at all, mostly based on the ID3 tags. There is generally no textual content that can meaningfully be used for classification.
  3. “MPG” is generally a video format that Semaphore can only extract a small amount of textual information, if any at all, mostly based on file id only. There is generally no textual content that can meaningfully be used for classification.
  4. “OVG” is generally an “image overlay” file format that Semaphore does not support. With such files, even if Semaphore did support them, there is generally no textual content it can meaningfully use for classification.
  5. “MP4” is generally an audio or video format that Semaphore can only extract a small amount of textual information from, if any at all, mostly based on the metadata. There is generally no textual content that can meaningfully be used for classification.
  6. “TIF” is generally an image format that Semaphore can only extract a small amount of textual information, if any at all, mostly based on the metadata. There is generally no textual content that can meaningfully be used for classification.
  7. “MSG” is generally an email format that may contain elements Semaphore cannot process such as (some) embedded objects.
  8. “CHM” is generally aa “Microsoft Compiled Help” file which is a archive-based format. In archive formats, such as ZIP, Semaphore can only process file names within the file and not the files contained in the archive themselves so classification of such content is generally not advisable as the results are not useful (even if they are generated).
  9. “DGN” is a CAD file format which Semaphore does not support. From such formats (much like any image file) there is generally no textual content that can meaningfully be used for classification..
  10. “RAR” and “ZIP” is generally an archive-based format. In archive formats Semaphore can only process file names within the file and not the files contained in the archive themselves so classification of such content is generally not advisable as the results are not useful (even if they are generated).
  11. “PUB” is generally a Microsoft Publisher file that Semaphore can only extract a small amount of textual information, if any at all, mostly based on file id only. There is generally no textual content that can meaningfully be used for classification.

For those file types Semaphore cannot handle or from which little meaningful data can be extracted it is recommended that you exclude them from classification.

Supported document formats

The following is a sample of the many file formats supported by Classification Server. Contact Progress if a format you require is not listed (as it very well may be supported).

Note: Although many formats are supported please be aware that the information present that is used by Semaphore is limited to the text-only content present in the format (including meta data). For example, labels on images may very well not be extracted as text content.

Archive Version
7z (BZIP2 and split archives not supported)
7z Self Extracting exe (BZIP2 and split archives not supported)
LZA Self Extracting Compress
LZH Compress
Microsoft Office Binder 95-97
Microsoft Cabinet (CAB)
RAR 1.5, 2.0, 2.9
Self-extracting .exe
UNIX Compress
UNIX GZip
UNIX tar
Uuencode
Zip PKZip
Zip WinZip
Zip Zip64
Database Version
DataEase 4.x
DBase III, IV, V
First Choice DB Through 3.0
Framework DB 3.0
Microsoft Access 1.0, 2.0, 95-2013
Microsoft Access Report Snapshot (File ID only) 2000 - 2003
Microsoft Works DB for DOS 2.0
Microsoft Works DB for Macintosh 2.0
Microsoft Works DB for Windows 3.0, 4.0
Microsoft Works DB for DOS 1.0
Paradox for DOS 2.0 - 4.0
Paradox for Windows 1.0
Q&A Database Through 2.0
R:Base R:Base 5000
R:Base R:Base System V
Reflex 2.0
SmartWare II DB 1.02
Email Version
Apple Mail Message (EMLX) 2.0
Encoded mail messages MHT
Encoded mail messages Multi Part Alternative
Encoded mail messages Multi Part Digest
Encoded mail messages Multi Part Mixed
Encoded mail messages Multi Part News Group
Encoded mail messages Multi Part Signed
Encoded mail messages TNEF
EML with Digital Signature SMIME
IBM Lotus Notes Domino XML Language DXL 8.5
IBM Lotus Notes NSF (File ID) 7.x, 8.x
IBM Lotus Notes NSF (Windows, Linux x86-32 and Oracle Solaris 32-bit only with Notes Client or Domino Server) 8.x
MBOX Mailbox RFC 822
Microsoft Outlook (MSG) 97 - 2013
Microsoft Outlook Express (EML)
Microsoft Outlook Forms Template (OFT) 97 - 2013
Microsoft Outlook OST 97 - 2013
Microsoft Outlook PST 97 - 2013
Microsoft Outlook PST(Mac) 2001
MSG with Digital Signature SMIME
Multimedia Version
AVI (Metadata extraction only)
Flash (text extraction only) 6.x, 7.x, Lite
Flash (File ID only) 9, 10
Real Media (File ID only)
MP3 (ID3 metadata only)
MPEG-1 Audio layer 3 V ID3 v1 (File ID only)
MPEG-1 Audio layer 3 V ID3 v2 (File ID only)
MPEG-1 Video V 2 (File ID only)
MPEG-1 Video V 3 (File ID only)
MPEG-2 Audio (File ID only)
MPEG-4 (Metadata extraction only)
MPEG-7 (Metadata extraction only)
QuickTime (Metadata extraction only)
Windows Media ASF (Metadata extraction only)
Windows Media DVR-MS (Metadata extraction only)
Windows Media Audio WMA (Metadata extraction only)
Windows Media Playlist (File ID only)
Windows Media Video WMV (Metadata extraction only)
WAV (Metadata extraction only)
Other Formats Version
AOL Messenger (File ID only) 7.3
Microsoft InfoPath (File ID only) 2007
Microsoft Live Messenger (via XML filter) 10.0
Microsoft Office Theme files (File ID only) 2007-2013
Microsoft OneNote (File ID only) 2007, 2010, 2013
Microsoft Project (table view only) 98 - 2003
Microsoft Project (table view only) 2007, 2010
Microsoft Windows Compiled Help (File ID only) .chm
Microsoft Windows DLL
Microsoft Windows Executable
Microsoft Windows Explorer Command (File ID only) .scf
Microsoft Windows Help (File ID only) .hlp
Microsoft Windows Shortcut (File ID only) .lnk
Trillian Text Log File (via text filter) 4.2
Trillian XML Log File (File ID only) 4.2
TrueType Font (File ID only) ttf, ttc
vCalendar 2.1
vCard 2.1
Yahoo! Messenger 6.x - 8
Presentation Version
Apple iWork Keynote (text and PDF preview) 09
Harvard Graphics Presentation DOS 3.0
IBM Lotus Symphony Presentations 1.x
Kingsoft WPS Presentation 2010
LibreOffice Impress 4.x
Lotus Freelance 1.0-Millennium 9.8
Lotus Freelance for OS/3 2
Lotus Freelance for Windows 95, 97, SmartSuite 9.8
Microsoft PowerPoint for Macintosh 4.0 - 2011
Microsoft PowerPoint for Windows 3.0 - 2013
Microsoft PowerPoint for Windows Slideshow 2007 - 2013
Microsoft PowerPoint for Windows Template 2007 - 2013
Novell Presentations 3.0, 7.0
OpenOffice Impress 1.1, 3.0
Oracle Open Office Impress 3.x
StarOffice Impress 5.2 - 9.0
WordPerfect Presentations 5.1 - X
Raster Image Version
Adobe Photoshop 4.0
Adobe Photoshop (File ID only)
Adobe Photoshop CS1-6
CALS Raster (GP4) Type I
CALS Raster (GP4) Type II
Computer Graphics Metafile ANSI
Computer Graphics Metafile CALS
Computer Graphics Metafile NIST
Encapsulated PostScript (EPS) TIFF header Only
GEM Image (Bitmap)
Graphics Interchange Format (GIF)
IBM Graphics Data Format (GDF) 1.0
IBM Picture Interchange Format 1.0
JBIG2 Graphic Embeddings in PDF
JFIF (JPEG not in TIFF format)
JPEG
JPEG 2000 JP2
Kodak Flash Pix
Kodak Photo CD 1.0
Lotus PIC
Lotus Snapshot
Macintosh PICT BMP only
Macintosh PICT2 BMP only
MacPaint
Microsoft Windows Bitmap
Microsoft Windows Cursor
Microsoft Windows Icon
OS/2 Bitmap
OS/2 Warp Bitmap
Paint Shop Pro (Win32 only) 5.0, 6.0
PC Paintbrush (PCX)
PC Paintbrush DCX (multi-page PCX)
Portable Bitmap (PBM)
Portable Graymap PGM
Portable Network Graphics (PNG)
Portable Pixmap (PPM)
Progressive JPEG
StarOffice Draw 6.x - 9.0
Sun Raster
TIFF Group 5 & 6
TIFF CCITT Group 3 & 4
TruVision TGA (Targa) 2.0
Word Perfect Graphics 1.0
WBMP wireless graphics format
X-Windows Bitmap x10 compatible
X-Windows Dump x10 compatible
X-Windows Pixmap x10 compatible
WordPerfect Graphics 2.0 - 10.0
Spreadsheet Version
Apple iWork Numbers (text and PDF preview) 09
Enable Spreadsheet 3.0 - 4.5
First Choice SS Through 3.0
Framework SS 3.0
IBM Lotus Symphony Spreadsheets 1.x
Kingsoft WPS Spreadsheets 2010
Lotus 1-2-3 Through Millennium 9.8
Lotus 1-2-3 Charts (DOS and Windows) Through 5.0
Lotus 1-2-3 for OS/2 2.0
Microsoft Excel Charts 2.x - 2007
Microsoft Excel for Macintosh 98 - 2011
Microsoft Excel for Windows 3.0 - 2013
Microsoft Excel for Windows (text only) 2003 XML
Microsoft Excel for Windows (.xlsb) 2007 - 2013 (Binary)
Microsoft Works SS for DOS 2.0
Microsoft Works SS for Macintosh 2.0
Microsoft Works SS for Windows 3.0, 4.0
Multiplan 4.0
Novell PerfectWorks Spreadsheet 2.0
OpenOffice Calc 1.1 - 3.0
Oracle Open Office Calc 3.x
PFS: Plan 1.0
QuattroPro for DOS Through 5.0
QuattroPro for Windows Through X6
SmartWare Spreadsheet
SmartWare II SS 1.02
StarOffice Calc 5.2 - 9.0
SuperCalc 5.0
Symphony Through 2.0
VP-Planner 1.0
Text & Markup Version
ANSI Text 7 & 8 bit
ASCII Text 7 & 8 bit
DOS character set
EBCDIC
HTML (CSS rendering not supported) 1.0 - 5.0
IBM DCA/RFT
Macintosh character set
Rich Text Format (RTF)
Unicode Text 3.0 , 4.0
UTF-8
Wireless Markup Language
XML (text only)
XHTML (file ID only) 1.0
Vector Image Version
Adobe Illustrator 4.0 - 7.0
Adobe Illustrator (PDF Preview only) 9.0, CS1-6
Adobe Illustrator XMP CS1-6
Adobe InDesign XMP CS1-6
Adobe InDesign Interchange (XMP only)
Adobe PDF 1.0 - 1.7 (Acrobat 1 - 10)
Adobe PDF Package 1.7 (Acrobat 8 - 10)
Adobe PDF Portfolio 1.7 (Acrobat 8 - 10)
Ami Draw SDW
AutoCAD Drawing 2.5, 2.6
AutoCAD Drawing 9.0 - 14.0
AutoCAD Drawing 2000i - 2013
AutoShade Rendering 2
Corel Draw 2.0 - 9.0
Corel Draw Clipart 5.0, 7.0
Enhanced Metafile (EMF)
Escher graphics
FrameMaker Graphics (FMV) 3.0 - 5.0
Gem File (Vector)
Harvard Graphics Chart DOS 2.0 - 3.0
Harvard Graphics for Windows
HP Graphics Language 2.0
IGES Drawing 5.1 - 5.3
Micrografx Designer through 3.1
Micrografx Designer 6.0
Micrografx Draw through 4.0
Microsoft XPS (Text only)
Novell PerfectWorks Draw 2
OpenOffice Draw 1.1 - 3.0
Oracle Open Office Draw 3.x
Visio (Page Preview mode WMF/EMF) 4.0
Visio 5.0 - 2010
Visio (text only) 2013
Visio XML VSX (File ID only) 2007
Windows Metafile
Word Processing Version
Adobe FrameMaker (MIF only) 3.0 - 6.0
Adobe Illustrator Postscript Level 2
Ami
Ami Pro for OS2
Ami Pro for Windows 2.0, 3.0
Apple iWork Pages (text and PDF preview 09
DEC DX Through 4.0
DEC DX Plus 4.0, 4.1
Enable Word Processor 3.0 - 4.5
First Choice WP 1.0, 3.0
Framework WP 3.0
Hangul 97 - 2007
IBM DCA/FFT
IBM DisplayWrite 2.0 - 5.0
IBM Writing Assistant 1.01
Ichitaro 5.0, 6.0, 8.0 - 13.0, 2004, 2010, 2013
JustWrite Through 3.0
Kingsoft WPS Writer 2010
Legacy 1.1
Lotus Manuscript Through 2.0
Lotus WordPro (text only) 9.7, 96 - Millennium 9.8
MacWrite II 1.1
Mass 11 through 8.0
Microsoft Publisher (File ID only) 2003 - 2007
Microsoft Word for DOS 4.0 - 6.0
Microsoft Word for Macintosh 4.0 - 6.0, 98 - 2011
Microsoft Word for Windows 1.0 - 2013
Microsoft Word for Windows (text only) 2003 XML
Microsoft Word for Windows 98-J
Microsoft WordPad
Microsoft Works WP for DOS 2.0
Microsoft Works WP for Macintosh 2.0
Microsoft Works WP for Windows 3.0, 4.0
Microsoft Write for Windows 1.0 - 3.0
MultiMate Through 4.0
MultiMate Advantage 2.0
Navy DIF
Nota Bene 3.0
Novell PerfectWorks Word Processor 2.0
OfficeWriter 4.0 - 6.0
OpenOffice Writer 1.1 - 3.0
Oracle Open Office Writer 3.x
PC File Doc 5.0
PFS: Write A, B
Professional Write for DOS 1.0, 2.0
Professional Write Plus for Windows 1.0
Q&A Write 2.0, 3.0
Samna Word IV 1.0 - 3.0
Samna Word IV+
Samsung JungUm Global (File ID only)
Signature 1.0
SmartWare II WP 1.02
Sprint 1.0
StarOffice Writer 5.2 - 9.0
Total Word 1.2
Wang IWP Through 2.6
WordMarc Composer
WordMarc Composer+
WordMarc Word Processor
WordPerfect for DOS 4.2
WordPerfect for Macintosh 1.02 - 3.1
WordPerfect for Windows 5.1 - X5
Wordstar 2000 for DOS 1.0 - 3.0
Wordstar 2000 for DOS 2.0, 3.0
Wordstar for DOS 3.0 - 7.0
Wordstar for Windows 1.0
XyWrite Through III+

“classify.py/classify.exe” full command reference

The “options” are as follows:

Option Description
-h, –help Show a help message and exit
-s SERVER
–server=SERVER
Server to use (defaults to local machine). As of Semaphore 4.0.48 you can specify the full Classification Server URL, if required, in this parameter (for example, if using Semaphore Cloud you can use the CS URL from the Basic API).
-t, –transfer Transfer file data with request - default is to transfer file data if server is specified else use file:\{full_path\} in request
-n, –notransfer Specify no transfer of file data - use full path to file in request
-p PORT
–port=PORT
Port to use (default is 5058 but default may be changed by setting CS_PORT environment variable)
-r, –recurse Recurse into subdirectories
–limit=LIMIT Specifies a limit on the number of files to classify (default is no limit)
–start=START Specifies the start for files to classify (ie skips the specified number of files) (default is 0)
–repeat=REPEAT Classify the same documents REPEAT times (default is 1)
-o OUTPUTFILE
–outputfile=OUTPUTFILE
Specifies a file to write output to (default is to write to the console)
-f FORMAT
–format=FORMAT
Specifies the formating applied to output (default is “XML”). Values allowed for “FORMAT” are “XML”, “CSV” or “HTML”. Use “–FormatFile” to specify an XSLT transform to apply to the output when using “XSL”.
–formatfile=FORMATFILE Specifies the XSLT Transform that is applied to each classification response - output is simply the concatenation of these outputs - <?xml version=“1.0”?><start/> and <?xml version=“1.0”?><finish/> are passed at the start and finish of bulk (or single) classifications to allow xslt to specify header and footer for output.
–exclude=EXCLUDEFILELIST Excludes any files listed in EXCLUDEFILELIST file
–filelist=INCLUDEFILELIST Includes files listed in INCLUDEFILELIST file
–record=RECORDFILE Appends a list of those files classified to the RECORDFILE file
–nothreads Don’t use separate threads for the classifications
–use_csreq_files Use “csreq” files for request options if they exist
–only_csreq_files Only classify files which have a “csreq” file
–make_csreq_files Creates “csreq” files on disk rather than sending request to Classification Server.
–cf=CFSERVERS Compare results between two servers and only output differences. CFSERVERS format is “server:port” (e.g. localhost:5058) or (as of Semaphore 4.0.48) the full URL http://localhost:5058 and multiple instances of this parameter can be specified. Note that if a “port” is not specified the port number on the server is assumed to be 5058.
–cfi=CFISERVERS Interactive mode version of the “–cf” parameter which displays differences in a GUI (“winmerge” in Windows, gvimdiff in Linux - if installed).
-a cloud_api_key, –cloud-api-key=cloud_api_key (Semaphore 4.0.48 and later) API Key required to authenticate in Smartlogic Cloud
-c url, –cloud-api-token-generation-url=url (Semaphore 4.0.48 and later) Url used to generate Api Token for Smartlogic Cloud (defaults to token)

Classification Server request options:

Option Description
–test Set request op=“TEST” (default op=“CLASSIFY”). See "Test" Request for details.
–version Return the version of Classification Server (return the result of a “versions” request).
–statistics Return usage statistics (return the result of a “statistics” request). See "Stats" (Statistics) Request for details.
–singlearticle Set request <singlearticle/> option (default is to use Classification Server default)
–multiarticle Set request <multiarticle/> option (default is to use Classification Server default)
–stylesheet Set request <stylesheet/> option (default is that no stylesheet applied to output)
–language=LANGUAGE Set request <language> option to language given (default is to use Classification Server default)
–threshold=THRESHOLD Set request <threshold> option (default is to use Classification Server default, normally configured as 48)
–title=TITLE Set request <Title> data (default none, that is, use title from document)
–body=BODY Set request <body> data (default none, use the body from the document)
–feedback Set request <feedback/> option (default false)
–feedbacktextonly Set request <feedback>TEXTONLY</feedback> option (does not mark up returned text)
–min_average_article_pagesize=MINAVERAGEARTICLEPAGESIZE Set request <min_average_article_pagesize> option (default is to use Classification Server default, normally configured as 1.0)
–num_articles_in_singlepass=NUMARTICLESINSINGLEPASS Set request <num_articles_in_singlepass> option (default is to use Classification Server default, normally configured as 25)
–char_count_cutoff=CHAR_COUNT_CUTOFF Set request <char_count_cutoff> option (default is to use Classification Server default, normally configured as 0). If 0, then there is no cut-off.
–meta=META Specifies meta data to include in request - use ‘X=Y’ which will translate to <META name=‘X’ value=‘Y’/> in the request
–debug=DEBUG Set request <debug> option (default none).
–use_generated_keys Set request <use_generated_keys/> option (default none).
–legacy Set request <legacy/> option (default none). Note CSV and HTML format outputs will fail with legacy format responses.
–classify_large_files Classifies files more than 50Mb in size.

XML File Submission

Rules written for XML files submitted for classification (directly in rules or via the rulebase templates) can specifically target XML elements present in the file (this is not to be confused with a XML request submitted to Classification Server - this is where the request references, or includes, a separate XML file for classification). For example if the following XML file is referenced in a classification request:

<?xml version="1.0" encoding="UTF-8"?>
<book>
  <author>A B Smith</author>
  <bio_info url="http://www.example.com/absmith">
    <born>1/1/1970</born>
    <facebook>absmith</facebook>
  </bio_info>
  <text>Some text here</text>
  <more_text>Some title text</more_text>
</book> 


Classification Server supports the limited use of “XPath”-style references to the fields (see field for further details) present in the XML document so, in this case, we could reference “book/text” in the “field” element of the rulebase to target the “Some text here” text. For example:

<phrase field="book/text" data="Some text here" />

To allow normal rulebases, which almost always use “body” restricted rules, work without modifications the actual XML document elements are placed within the “body” section of the request. So, in our example we could access the same “Some text here” information using “body/book/text”.

For convenient usage of any style of XML document the attribute values are also addressable using the XPath convention of prefixing “@” to refer to an attribute rather than an element name. From the above example:

<text field="bio_info/@url" data="http://www.example.com/absmith" />


For the scoping of the search, attribute values are treated exactly as if they were child elements and so the values of attributes are selected for the scope of a search as well as any child elements. For example:

<phrase field="person" data="Fred Bloggs" />


will find a match in

<person name="Fred Bloggs" /> 

or

<person>
  <name>Fred Bloggs</name>
</person>

Though in the first case the field the name is found in is actually “@name” rather than “name” which allows you to distinguish between the two cases if required.

Note: Because of this functionality if you have a “field” value of “body” in the XML file then this could cause confusion as in this case “body” specified as a “field” value could refer to the entire XML document OR the XML element (body) itself (in this case Classification Server will reference the entire file not the specific “body” element within it).

TitleResults for “How to create a CRG?”Also Available inAlert