Appendix - Additional Information
- Last Updated: May 13, 2026
- 14 minute read
- Semaphore
- Documentation
Document Text Extraction
When processing a submitted document Classification Server will first attempt to extract text then use this extracted text in the classification process. Without any meaningful textual content the classification process is irrelevant, for example, there is no way to classify an image file because the only text the file generally contains is meta information encoded within the file structure (Classification Server does not do OCR). With files like this it is perfectly valid to either see an error generated by the Classification Server request or no classifications returned from the process at all. Indeed, it is quite common in a correctly configured production installation to regularly see errors generated for certain documents.
If Classification Server is unable to extract text this could be due to any of the following situations:
- Classification Server does not handle the specific file format/version that is being used - Semaphore uses a third party library for content extraction that is very good but there are some formats that are simply not handled. The general list of file formats which this third party handles can be found below but note that support for things like specific versions of the file format (that are generated by different versions of applications) may differ. This results in errors such as “Unspecified error reading”, and “Type identified correctly but no available filter to handle content”.
- Classification Server cannot extract text from secured documents - Often for security and technical reasons Semaphore cannot extract text from documents that have been secured with a password or encrypted in some way. This results in errors such as “Password Protected or encrypted file FILENAME” and “Could not open the FILENAME” file FILENAME reason: This document requires authentication“. - Classification Server cannot parse the file provided - Even if the file can be opened in the application that generated it the third party libraries may not be able to support the specific information included in it. Often this can be because the third party does not handle all variations of the format that are available or that the published specification used does not include it (that is, the application is generating information in a format the third party does not know about). This results in errors such as ”Unknown exception caught!“, ”Can not open the FILETYPE file FILENAME : The file is damaged and could not be repaired“, ”Unspecified error reading“, ”Failed to obtain content handle for FILENAME“, ”Adobe PDF parse Failed on page X“, ”File is invalid xml“, ”XML error” and “Unspecified error reading embedded object”.
Specific File Format Content Extraction Notes
The following are some comments regarding what content is extracted from specific file types:
- “SWF” is generally a Adobe Flash file which Semaphore can only extract a very small amount of textual information, if any at all. There is generally no textual content that can meaningfully be used for classification.
- “MP3” is generally an audio format that Semaphore can only extract a small amount of textual information, if any at all, mostly based on the ID3 tags. There is generally no textual content that can meaningfully be used for classification.
- “MPG” is generally a video format that Semaphore can only extract a small amount of textual information, if any at all, mostly based on file id only. There is generally no textual content that can meaningfully be used for classification.
- “OVG” is generally an “image overlay” file format that Semaphore does not support. With such files, even if Semaphore did support them, there is generally no textual content it can meaningfully use for classification.
- “MP4” is generally an audio or video format that Semaphore can only extract a small amount of textual information from, if any at all, mostly based on the metadata. There is generally no textual content that can meaningfully be used for classification.
- “TIF” is generally an image format that Semaphore can only extract a small amount of textual information, if any at all, mostly based on the metadata. There is generally no textual content that can meaningfully be used for classification.
- “MSG” is generally an email format that may contain elements Semaphore cannot process such as (some) embedded objects.
- “CHM” is generally aa “Microsoft Compiled Help” file which is a archive-based format. In archive formats, such as ZIP, Semaphore can only process file names within the file and not the files contained in the archive themselves so classification of such content is generally not advisable as the results are not useful (even if they are generated).
- “DGN” is a CAD file format which Semaphore does not support. From such formats (much like any image file) there is generally no textual content that can meaningfully be used for classification..
- “RAR” and “ZIP” is generally an archive-based format. In archive formats Semaphore can only process file names within the file and not the files contained in the archive themselves so classification of such content is generally not advisable as the results are not useful (even if they are generated).
- “PUB” is generally a Microsoft Publisher file that Semaphore can only extract a small amount of textual information, if any at all, mostly based on file id only. There is generally no textual content that can meaningfully be used for classification.
For those file types Semaphore cannot handle or from which little meaningful data can be extracted it is recommended that you exclude them from classification.
Supported document formats
The following is a sample of the many file formats supported by Classification Server. Contact Progress if a format you require is not listed (as it very well may be supported).
Note: Although many formats are supported please be aware that the information present that is used by Semaphore is limited to the text-only content present in the format (including meta data). For example, labels on images may very well not be extracted as text content.
| Archive | Version |
|---|---|
| 7z (BZIP2 and split archives not supported) | |
| 7z Self Extracting exe (BZIP2 and split archives not supported) | |
| LZA Self Extracting Compress | |
| LZH Compress | |
| Microsoft Office Binder | 95-97 |
| Microsoft Cabinet (CAB) | |
| RAR | 1.5, 2.0, 2.9 |
| Self-extracting .exe | |
| UNIX Compress | |
| UNIX GZip | |
| UNIX tar | |
| Uuencode | |
| Zip | PKZip |
| Zip | WinZip |
| Zip | Zip64 |
| Database | Version |
|---|---|
| DataEase | 4.x |
| DBase | III, IV, V |
| First Choice DB | Through 3.0 |
| Framework DB | 3.0 |
| Microsoft Access | 1.0, 2.0, 95-2013 |
| Microsoft Access Report Snapshot (File ID only) | 2000 - 2003 |
| Microsoft Works DB for DOS | 2.0 |
| Microsoft Works DB for Macintosh | 2.0 |
| Microsoft Works DB for Windows | 3.0, 4.0 |
| Microsoft Works DB for DOS | 1.0 |
| Paradox for DOS | 2.0 - 4.0 |
| Paradox for Windows | 1.0 |
| Q&A Database | Through 2.0 |
| R:Base | R:Base 5000 |
| R:Base | R:Base System V |
| Reflex | 2.0 |
| SmartWare II DB | 1.02 |
| Version | |
|---|---|
| Apple Mail Message (EMLX) | 2.0 |
| Encoded mail messages | MHT |
| Encoded mail messages | Multi Part Alternative |
| Encoded mail messages | Multi Part Digest |
| Encoded mail messages | Multi Part Mixed |
| Encoded mail messages | Multi Part News Group |
| Encoded mail messages | Multi Part Signed |
| Encoded mail messages | TNEF |
| EML with Digital Signature | SMIME |
| IBM Lotus Notes Domino XML Language DXL | 8.5 |
| IBM Lotus Notes NSF (File ID) | 7.x, 8.x |
| IBM Lotus Notes NSF (Windows, Linux x86-32 and Oracle Solaris 32-bit only with Notes Client or Domino Server) | 8.x |
| MBOX Mailbox | RFC 822 |
| Microsoft Outlook (MSG) | 97 - 2013 |
| Microsoft Outlook Express (EML) | |
| Microsoft Outlook Forms Template (OFT) | 97 - 2013 |
| Microsoft Outlook OST | 97 - 2013 |
| Microsoft Outlook PST | 97 - 2013 |
| Microsoft Outlook PST(Mac) | 2001 |
| MSG with Digital Signature | SMIME |
| Multimedia | Version |
|---|---|
| AVI (Metadata extraction only) | |
| Flash (text extraction only) | 6.x, 7.x, Lite |
| Flash (File ID only) | 9, 10 |
| Real Media (File ID only) | |
| MP3 (ID3 metadata only) | |
| MPEG-1 Audio layer 3 V ID3 v1 (File ID only) | |
| MPEG-1 Audio layer 3 V ID3 v2 (File ID only) | |
| MPEG-1 Video V 2 (File ID only) | |
| MPEG-1 Video V 3 (File ID only) | |
| MPEG-2 Audio (File ID only) | |
| MPEG-4 (Metadata extraction only) | |
| MPEG-7 (Metadata extraction only) | |
| QuickTime (Metadata extraction only) | |
| Windows Media ASF (Metadata extraction only) | |
| Windows Media DVR-MS (Metadata extraction only) | |
| Windows Media Audio WMA (Metadata extraction only) | |
| Windows Media Playlist (File ID only) | |
| Windows Media Video WMV (Metadata extraction only) | |
| WAV (Metadata extraction only) |
| Other Formats | Version |
|---|---|
| AOL Messenger (File ID only) | 7.3 |
| Microsoft InfoPath (File ID only) | 2007 |
| Microsoft Live Messenger (via XML filter) | 10.0 |
| Microsoft Office Theme files (File ID only) | 2007-2013 |
| Microsoft OneNote (File ID only) | 2007, 2010, 2013 |
| Microsoft Project (table view only) | 98 - 2003 |
| Microsoft Project (table view only) | 2007, 2010 |
| Microsoft Windows Compiled Help (File ID only) | .chm |
| Microsoft Windows DLL | |
| Microsoft Windows Executable | |
| Microsoft Windows Explorer Command (File ID only) | .scf |
| Microsoft Windows Help (File ID only) | .hlp |
| Microsoft Windows Shortcut (File ID only) | .lnk |
| Trillian Text Log File (via text filter) | 4.2 |
| Trillian XML Log File (File ID only) | 4.2 |
| TrueType Font (File ID only) | ttf, ttc |
| vCalendar | 2.1 |
| vCard | 2.1 |
| Yahoo! Messenger | 6.x - 8 |
| Presentation | Version |
|---|---|
| Apple iWork Keynote (text and PDF preview) | 09 |
| Harvard Graphics Presentation DOS | 3.0 |
| IBM Lotus Symphony Presentations | 1.x |
| Kingsoft WPS Presentation | 2010 |
| LibreOffice Impress | 4.x |
| Lotus Freelance | 1.0-Millennium 9.8 |
| Lotus Freelance for OS/3 | 2 |
| Lotus Freelance for Windows | 95, 97, SmartSuite 9.8 |
| Microsoft PowerPoint for Macintosh | 4.0 - 2011 |
| Microsoft PowerPoint for Windows | 3.0 - 2013 |
| Microsoft PowerPoint for Windows Slideshow | 2007 - 2013 |
| Microsoft PowerPoint for Windows Template | 2007 - 2013 |
| Novell Presentations | 3.0, 7.0 |
| OpenOffice Impress | 1.1, 3.0 |
| Oracle Open Office Impress | 3.x |
| StarOffice Impress | 5.2 - 9.0 |
| WordPerfect Presentations | 5.1 - X |
| Raster Image | Version |
|---|---|
| Adobe Photoshop | 4.0 |
| Adobe Photoshop (File ID only) | |
| Adobe Photoshop | CS1-6 |
| CALS Raster (GP4) | Type I |
| CALS Raster (GP4) | Type II |
| Computer Graphics Metafile | ANSI |
| Computer Graphics Metafile | CALS |
| Computer Graphics Metafile | NIST |
| Encapsulated PostScript (EPS) | TIFF header Only |
| GEM Image (Bitmap) | |
| Graphics Interchange Format (GIF) | |
| IBM Graphics Data Format (GDF) | 1.0 |
| IBM Picture Interchange Format | 1.0 |
| JBIG2 | Graphic Embeddings in PDF |
| JFIF (JPEG not in TIFF format) | |
| JPEG | |
| JPEG 2000 | JP2 |
| Kodak Flash Pix | |
| Kodak Photo CD | 1.0 |
| Lotus PIC | |
| Lotus Snapshot | |
| Macintosh PICT | BMP only |
| Macintosh PICT2 | BMP only |
| MacPaint | |
| Microsoft Windows Bitmap | |
| Microsoft Windows Cursor | |
| Microsoft Windows Icon | |
| OS/2 Bitmap | |
| OS/2 Warp Bitmap | |
| Paint Shop Pro (Win32 only) | 5.0, 6.0 |
| PC Paintbrush (PCX) | |
| PC Paintbrush DCX (multi-page PCX) | |
| Portable Bitmap (PBM) | |
| Portable Graymap PGM | |
| Portable Network Graphics (PNG) | |
| Portable Pixmap (PPM) | |
| Progressive JPEG | |
| StarOffice Draw | 6.x - 9.0 |
| Sun Raster | |
| TIFF | Group 5 & 6 |
| TIFF CCITT | Group 3 & 4 |
| TruVision TGA (Targa) | 2.0 |
| Word Perfect Graphics | 1.0 |
| WBMP wireless graphics format | |
| X-Windows Bitmap | x10 compatible |
| X-Windows Dump | x10 compatible |
| X-Windows Pixmap | x10 compatible |
| WordPerfect Graphics | 2.0 - 10.0 |
| Spreadsheet | Version |
|---|---|
| Apple iWork Numbers (text and PDF preview) | 09 |
| Enable Spreadsheet | 3.0 - 4.5 |
| First Choice SS | Through 3.0 |
| Framework SS | 3.0 |
| IBM Lotus Symphony Spreadsheets | 1.x |
| Kingsoft WPS Spreadsheets | 2010 |
| Lotus 1-2-3 | Through Millennium 9.8 |
| Lotus 1-2-3 Charts (DOS and Windows) | Through 5.0 |
| Lotus 1-2-3 for OS/2 | 2.0 |
| Microsoft Excel Charts | 2.x - 2007 |
| Microsoft Excel for Macintosh | 98 - 2011 |
| Microsoft Excel for Windows | 3.0 - 2013 |
| Microsoft Excel for Windows (text only) | 2003 XML |
| Microsoft Excel for Windows (.xlsb) | 2007 - 2013 (Binary) |
| Microsoft Works SS for DOS | 2.0 |
| Microsoft Works SS for Macintosh | 2.0 |
| Microsoft Works SS for Windows | 3.0, 4.0 |
| Multiplan | 4.0 |
| Novell PerfectWorks Spreadsheet | 2.0 |
| OpenOffice Calc | 1.1 - 3.0 |
| Oracle Open Office Calc | 3.x |
| PFS: Plan | 1.0 |
| QuattroPro for DOS | Through 5.0 |
| QuattroPro for Windows | Through X6 |
| SmartWare Spreadsheet | |
| SmartWare II SS | 1.02 |
| StarOffice Calc | 5.2 - 9.0 |
| SuperCalc | 5.0 |
| Symphony | Through 2.0 |
| VP-Planner | 1.0 |
| Text & Markup | Version |
|---|---|
| ANSI Text | 7 & 8 bit |
| ASCII Text | 7 & 8 bit |
| DOS character set | |
| EBCDIC | |
| HTML (CSS rendering not supported) | 1.0 - 5.0 |
| IBM DCA/RFT | |
| Macintosh character set | |
| Rich Text Format (RTF) | |
| Unicode Text | 3.0 , 4.0 |
| UTF-8 | |
| Wireless Markup Language | |
| XML (text only) | |
| XHTML (file ID only) | 1.0 |
| Vector Image | Version |
|---|---|
| Adobe Illustrator | 4.0 - 7.0 |
| Adobe Illustrator (PDF Preview only) | 9.0, CS1-6 |
| Adobe Illustrator XMP | CS1-6 |
| Adobe InDesign XMP | CS1-6 |
| Adobe InDesign Interchange (XMP only) | |
| Adobe PDF | 1.0 - 1.7 (Acrobat 1 - 10) |
| Adobe PDF Package | 1.7 (Acrobat 8 - 10) |
| Adobe PDF Portfolio | 1.7 (Acrobat 8 - 10) |
| Ami Draw | SDW |
| AutoCAD Drawing | 2.5, 2.6 |
| AutoCAD Drawing | 9.0 - 14.0 |
| AutoCAD Drawing | 2000i - 2013 |
| AutoShade Rendering | 2 |
| Corel Draw | 2.0 - 9.0 |
| Corel Draw Clipart | 5.0, 7.0 |
| Enhanced Metafile (EMF) | |
| Escher graphics | |
| FrameMaker Graphics (FMV) | 3.0 - 5.0 |
| Gem File (Vector) | |
| Harvard Graphics Chart DOS | 2.0 - 3.0 |
| Harvard Graphics for Windows | |
| HP Graphics Language | 2.0 |
| IGES Drawing | 5.1 - 5.3 |
| Micrografx Designer | through 3.1 |
| Micrografx Designer | 6.0 |
| Micrografx Draw | through 4.0 |
| Microsoft XPS (Text only) | |
| Novell PerfectWorks Draw | 2 |
| OpenOffice Draw | 1.1 - 3.0 |
| Oracle Open Office Draw | 3.x |
| Visio (Page Preview mode WMF/EMF) | 4.0 |
| Visio | 5.0 - 2010 |
| Visio (text only) | 2013 |
| Visio XML VSX (File ID only) | 2007 |
| Windows Metafile |
| Word Processing | Version |
|---|---|
| Adobe FrameMaker (MIF only) | 3.0 - 6.0 |
| Adobe Illustrator Postscript | Level 2 |
| Ami | |
| Ami Pro for OS2 | |
| Ami Pro for Windows | 2.0, 3.0 |
| Apple iWork Pages (text and PDF preview | 09 |
| DEC DX | Through 4.0 |
| DEC DX Plus | 4.0, 4.1 |
| Enable Word Processor | 3.0 - 4.5 |
| First Choice WP | 1.0, 3.0 |
| Framework WP | 3.0 |
| Hangul | 97 - 2007 |
| IBM DCA/FFT | |
| IBM DisplayWrite | 2.0 - 5.0 |
| IBM Writing Assistant | 1.01 |
| Ichitaro | 5.0, 6.0, 8.0 - 13.0, 2004, 2010, 2013 |
| JustWrite | Through 3.0 |
| Kingsoft WPS Writer | 2010 |
| Legacy | 1.1 |
| Lotus Manuscript | Through 2.0 |
| Lotus WordPro (text only) | 9.7, 96 - Millennium 9.8 |
| MacWrite II | 1.1 |
| Mass 11 | through 8.0 |
| Microsoft Publisher (File ID only) | 2003 - 2007 |
| Microsoft Word for DOS | 4.0 - 6.0 |
| Microsoft Word for Macintosh | 4.0 - 6.0, 98 - 2011 |
| Microsoft Word for Windows | 1.0 - 2013 |
| Microsoft Word for Windows (text only) | 2003 XML |
| Microsoft Word for Windows | 98-J |
| Microsoft WordPad | |
| Microsoft Works WP for DOS | 2.0 |
| Microsoft Works WP for Macintosh | 2.0 |
| Microsoft Works WP for Windows | 3.0, 4.0 |
| Microsoft Write for Windows | 1.0 - 3.0 |
| MultiMate | Through 4.0 |
| MultiMate Advantage | 2.0 |
| Navy DIF | |
| Nota Bene | 3.0 |
| Novell PerfectWorks Word Processor | 2.0 |
| OfficeWriter | 4.0 - 6.0 |
| OpenOffice Writer | 1.1 - 3.0 |
| Oracle Open Office Writer | 3.x |
| PC File Doc | 5.0 |
| PFS: Write | A, B |
| Professional Write for DOS | 1.0, 2.0 |
| Professional Write Plus for Windows | 1.0 |
| Q&A Write | 2.0, 3.0 |
| Samna Word IV | 1.0 - 3.0 |
| Samna Word IV+ | |
| Samsung JungUm Global (File ID only) | |
| Signature | 1.0 |
| SmartWare II WP | 1.02 |
| Sprint | 1.0 |
| StarOffice Writer | 5.2 - 9.0 |
| Total Word | 1.2 |
| Wang IWP | Through 2.6 |
| WordMarc Composer | |
| WordMarc Composer+ | |
| WordMarc Word Processor | |
| WordPerfect for DOS | 4.2 |
| WordPerfect for Macintosh | 1.02 - 3.1 |
| WordPerfect for Windows | 5.1 - X5 |
| Wordstar 2000 for DOS | 1.0 - 3.0 |
| Wordstar 2000 for DOS | 2.0, 3.0 |
| Wordstar for DOS | 3.0 - 7.0 |
| Wordstar for Windows | 1.0 |
| XyWrite | Through III+ |
“classify.py/classify.exe” full command reference
The “options” are as follows:
| Option | Description |
|---|---|
| -h, –help | Show a help message and exit |
| -s SERVER –server=SERVER |
Server to use (defaults to local machine). As of Semaphore 4.0.48 you can specify the full Classification Server URL, if required, in this parameter (for example, if using Semaphore Cloud you can use the CS URL from the Basic API). |
| -t, –transfer | Transfer file data with request - default is to transfer file data if server is specified else use file:\{full_path\} in request |
| -n, –notransfer | Specify no transfer of file data - use full path to file in request |
| -p PORT –port=PORT |
Port to use (default is 5058 but default may be changed by setting CS_PORT environment variable) |
| -r, –recurse | Recurse into subdirectories |
| –limit=LIMIT | Specifies a limit on the number of files to classify (default is no limit) |
| –start=START | Specifies the start for files to classify (ie skips the specified number of files) (default is 0) |
| –repeat=REPEAT | Classify the same documents REPEAT times (default is 1) |
| -o OUTPUTFILE –outputfile=OUTPUTFILE |
Specifies a file to write output to (default is to write to the console) |
| -f FORMAT –format=FORMAT |
Specifies the formating applied to output (default is “XML”). Values allowed for “FORMAT” are “XML”, “CSV” or “HTML”. Use “–FormatFile” to specify an XSLT transform to apply to the output when using “XSL”. |
| –formatfile=FORMATFILE | Specifies the XSLT Transform that is applied to each classification response - output is simply the concatenation of these outputs - <?xml version=“1.0”?><start/> and <?xml version=“1.0”?><finish/> are passed at the start and finish of bulk (or single) classifications to allow xslt to specify header and footer for output. |
| –exclude=EXCLUDEFILELIST | Excludes any files listed in EXCLUDEFILELIST file |
| –filelist=INCLUDEFILELIST | Includes files listed in INCLUDEFILELIST file |
| –record=RECORDFILE | Appends a list of those files classified to the RECORDFILE file |
| –nothreads | Don’t use separate threads for the classifications |
| –use_csreq_files | Use “csreq” files for request options if they exist |
| –only_csreq_files | Only classify files which have a “csreq” file |
| –make_csreq_files | Creates “csreq” files on disk rather than sending request to Classification Server. |
| –cf=CFSERVERS | Compare results between two servers and only output differences. CFSERVERS format is “server:port” (e.g. localhost:5058) or (as of Semaphore 4.0.48) the full URL http://localhost:5058 and multiple instances of this parameter can be specified. Note that if a “port” is not specified the port number on the server is assumed to be 5058. |
| –cfi=CFISERVERS | Interactive mode version of the “–cf” parameter which displays differences in a GUI (“winmerge” in Windows, gvimdiff in Linux - if installed). |
| -a cloud_api_key, –cloud-api-key=cloud_api_key | (Semaphore 4.0.48 and later) API Key required to authenticate in Smartlogic Cloud |
| -c url, –cloud-api-token-generation-url=url | (Semaphore 4.0.48 and later) Url used to generate Api Token for Smartlogic Cloud (defaults to token) |
Classification Server request options:
| Option | Description |
|---|---|
| –test | Set request op=“TEST” (default op=“CLASSIFY”). See "Test" Request for details. |
| –version | Return the version of Classification Server (return the result of a “versions” request). |
| –statistics | Return usage statistics (return the result of a “statistics” request). See "Stats" (Statistics) Request for details. |
| –singlearticle | Set request <singlearticle/> option (default is to use Classification Server default) |
| –multiarticle | Set request <multiarticle/> option (default is to use Classification Server default) |
| –stylesheet | Set request <stylesheet/> option (default is that no stylesheet applied to output) |
| –language=LANGUAGE | Set request <language> option to language given (default is to use Classification Server default) |
| –threshold=THRESHOLD | Set request <threshold> option (default is to use Classification Server default, normally configured as 48) |
| –title=TITLE | Set request <Title> data (default none, that is, use title from document) |
| –body=BODY | Set request <body> data (default none, use the body from the document) |
| –feedback | Set request <feedback/> option (default false) |
| –feedbacktextonly | Set request <feedback>TEXTONLY</feedback> option (does not mark up returned text) |
| –min_average_article_pagesize=MINAVERAGEARTICLEPAGESIZE | Set request <min_average_article_pagesize> option (default is to use Classification Server default, normally configured as 1.0) |
| –num_articles_in_singlepass=NUMARTICLESINSINGLEPASS | Set request <num_articles_in_singlepass> option (default is to use Classification Server default, normally configured as 25) |
| –char_count_cutoff=CHAR_COUNT_CUTOFF | Set request <char_count_cutoff> option (default is to use Classification Server default, normally configured as 0). If 0, then there is no cut-off. |
| –meta=META | Specifies meta data to include in request - use ‘X=Y’ which will translate to <META name=‘X’ value=‘Y’/> in the request |
| –debug=DEBUG | Set request <debug> option (default none). |
| –use_generated_keys | Set request <use_generated_keys/> option (default none). |
| –legacy | Set request <legacy/> option (default none). Note CSV and HTML format outputs will fail with legacy format responses. |
| –classify_large_files | Classifies files more than 50Mb in size. |
XML File Submission
Rules written for XML files submitted for classification (directly in rules or via the rulebase templates) can specifically target XML elements present in the file (this is not to be confused with a XML request submitted to Classification Server - this is where the request references, or includes, a separate XML file for classification). For example if the following XML file is referenced in a classification request:
<?xml version="1.0" encoding="UTF-8"?>
<book>
<author>A B Smith</author>
<bio_info url="http://www.example.com/absmith">
<born>1/1/1970</born>
<facebook>absmith</facebook>
</bio_info>
<text>Some text here</text>
<more_text>Some title text</more_text>
</book>
Classification Server supports the limited use of “XPath”-style references to the fields (see field for further details) present in the XML document so, in this case, we could reference “book/text” in the “field” element of the rulebase to target the “Some text here” text. For example:
<phrase field="book/text" data="Some text here" />
To allow normal rulebases, which almost always use “body” restricted rules, work without modifications the actual XML document elements are placed within the “body” section of the request. So, in our example we could access the same “Some text here” information using “body/book/text”.
For convenient usage of any style of XML document the attribute values are also addressable using the XPath convention of prefixing “@” to refer to an attribute rather than an element name. From the above example:
<text field="bio_info/@url" data="http://www.example.com/absmith" />
For the scoping of the search, attribute values are treated exactly as if they were child elements and so the values of attributes are selected for the scope of a search as well as any child elements. For example:
<phrase field="person" data="Fred Bloggs" />
will find a match in
<person name="Fred Bloggs" />
or
<person>
<name>Fred Bloggs</name>
</person>
Though in the first case the field the name is found in is actually “@name” rather than “name” which allows you to distinguish between the two cases if required.
Note: Because of this functionality if you have a “field” value of “body” in the XML file then this could cause confusion as in this case “body” specified as a “field” value could refer to the entire XML document OR the XML element (body) itself (in this case Classification Server will reference the entire file not the specific “body” element within it).