Calling Classification and Language Service

Save PDF

Last Updated: May 13, 2026
11 minute read

Semaphore
Documentation

Requests are submitted to Classification and Language Service either by:

Submitting an HTTP POST multi-part form formatted request.
Submitting an HTTP POST with an XML formatted request.

Each of these request types is discussed in the following sections. The first request type, HTTP POST multipart/form-data, is the simplest and most standard approach.

All requests are submitted via HTTP to the server on which Classification and Language Service is installed and via the port specified as part of the configuration (for example: “server-name:5058”).

HTTP POST with Multipart Form Data Formatted Requests

Requests can be submitted as a “multipart/form-data” POST HTTP request to Classification and Language Service by sending the content and options as HTTP POST fields including “title”, “threshold”, “body” or “UploadFile”. Use the “UploadFile” field to post any type of document, text or binary. Use “body” to post small amounts of text content. Most programming languages provide standard HTTP libraries to perform HTTP POST requests.

For example, using CURL this could be scripted as follows to classify a text document named mydocument.txt with a specified title:

curl -v \
  -F "title=Dumped Cars" \
  -F "UploadFile=@mydocument.txt" \
  http://localhost:5058

The HTTP request sent to Classification and Language Service is as follows:

Content-Type: multipart/form-data; boundary=---------------------------184228830106
Content-Length: 1129
-----------------------------184228830106
Content-Disposition: form-data; name="title"

Dumped Cars
-----------------------------184228830106
Content-Disposition: form-data; name="UploadFile"; filename="mydocument.txt"
Content-Type: text/plain

Here are the cars that were dumped... 
-----------------------------184228830106—

In the following example the script sends the body directly to be classified with the threshold lowered to 20:

curl -v \
  -F "title=Dumped Cars" \
  -F "body=This is a document about dumped cars." \
  -F "threshold=20" \
  http://localhost:5058

The relevant names of all the possible form-data parameters are as follows:

Form Field Name	Notes
title	The title of the document specified explicitly
path	A URL to the document to be classified. Classification Server must be able to download the document with this URL
body	The text body of the document. Not used if path specified, or UploadFile used.
type	The type of content, TEXT, HTML, XML, etc.
clustering_type	When using multi-article mode, the algorithm used to calculate overall classifications to the overall set.
clustering_threshold	The threshold after algorithm calculation to promote a META element to the overall set.
threshold	The classification threshold, from 1 to 99
language	If not using a language pack then this should be a valid language code, e.g. “en”, “en1”, “fr”, “it”, “de”, “nl”, “pt”, “da”, “no”, or “sy”. If left blank then the default system language will be used. If using Classification and Language Service with a language pack then the code provided will need to be a valid ISO639-1, ISO639-2 (T or B) code, the ISO language name or native name (see List_of_ISO_639-1_codes for details). Further, if using a language pack you can leave this value blank and the system will attempt to guess the language of the document (as a best fit from the installed language packs).
debug
operation	Almost always “CLASSIFY”
singlearticle	True or not set
multiarticle	True or not set
feedback	May optionally contain “TEXTONLY” as value.
meta_<NAME>	Specifies a meta value note.
UploadFile	Used to post files of any kind, binary or text
XML_INPUT	XML document in Classification Server DTD
From Semaphore 5.10.0
publish_set_name_list	a pipe separated list of publish sets for which results should be returned (default all)
publish_set	the one publish set for which results should be returned (default all)
override_type	Where a file is being uploaded and the Classification Server is not identifying the type as expected, override_type can be used to force a particular type. The most common usage would be where “XML” documents are presented without the <?xml version=“1.0” encoding=“UTF-8”?> header. Technically these are not valid XML documents, but XML parsing can be forced by setting - override_type=“XML (1150)”

XML Requests

The full syntax (DTD) for the XML submitted to Classification and Language Service is found in the “XML DTD” Appendix.

Note: If submitting as an HTTP GET request, the XML submitted should be a URI encoded value of an “XML_INPUT” CGI parameter.

XML “Classify” Requests

To submit a document for classification, you use the “request” operation of “classify” in the XML (XML_INPUT) request, for example:

<?xml version="1.0" ?>
<request op="CLASSIFY">
  <document>
    <title>Dumped Cars</title>
    <path>http://localhost/dumped_cars.txt</path>
  </document>
</request>

In this case, you can see we are submitting the document found at http://localhost/dumped_cars.txt for classification but overriding the title that may be present in this document with “Dumped Cars”.

The “PATH” value can specify any of the following protocols - FTP, FTPS, TFTP, HTTP, HTTPS, GOPHER, TELNET, DICT, FILE, and LDAP with the following notes:

“FTPS” and “HTTPS” protocols are only available if an SSL implementation is available on the server.
Any “PATH” value (URL) specified must be accessible from the Classification and Language Service itself.
Classification and Language Service supports the classification of documents in many different formats (see Appendix D for a list of supported document formats).
The “FILE” protocol is, by default, disabled due to security considerations as it allows classification of any file on the Classification and Language Service file system when requested. If the FILE protocol is required then the service account Classification and Language Service executes under (by default “Local System”) should be restricted to deny access to any sensitive data (if any) stored on the server. At that point the protocol may be enabled by adjusting the configuration file to add <property name=“AllowFileURLs” value=“true”/> to the <bean id=“acquisition” class=“acquisition”> section.
If the document can be retrieved from the “PATH” specified then the “body” value is ignored, similarly, if the document cannot be retrieved then the “body” value will be used.

Example 1 - Plain Text Content

The content is specified in the body tag.

Note: illegal XML characters (such as ‘&’) must be replaced by entity references (e.g. &).

<?xml version="1.0" encoding="iso-8859-1" ?>
<request op="CLASSIFY">
<document>
    <title>Bonfires</title>
    <body>Bonfires have traditionally been used by gardeners to dispose of unwanted garden detritus. However, they can also cause health and safety problems, especially if situated near a public highway. The environmental impact is also a consideration. A garden's waste should where possible be composted - many local authorities provide free or subsidised composters.

Bonfires & Fireworks

Bonfires on November 5th. Members of the public are strongly advised not to host their own bonfire parties, and instead attend officially organised bonfire and firework displays. The number of injuries from bonfires and fireworks each year is considerable and… etc.</body>
</document>
</request>

Example 2 - HTML Content

The document type is HTML and the entire HTML document is given in the body tag. No title has been specified as this will be extracted from the title tag in the HTML document.

Again, illegal XML characters need to be avoided. Characters such as ampersand which appear as entities in HTML (&) would therefore need to be double-encoded (&). Since this double encoding is rather painful to do manually it is often easier (but not required) to use the XML CDATA mechanism. This means enclosing the html data in the <body> tag between <![CDATA[ and ]]>. This normally simplifies the task when creating a request manually:

<?xml version="1.0" encoding="iso-8859-1" ?>
<request op="CLASSIFY">
<document>
    <body type="HTML"><![CDATA[<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
   <title>Bonfires</title>
</head>
<body>
<p>Bonfires have traditionally been used by gardeners to dispose of unwanted garden detritus. However, they can also cause health and safety problems, especially if situated near a public highway. The environmental impact is also a consideration. A garden's waste should where possible be
composted - many local authorities provide free or subsidised composters.</p>

<img src='composter.gif' alt='Composter' height='30' width='20' />

<p><h2>Bonfires & Fireworks</h2>

Bonfires on November 5th. Members of the public are strongly advised not to host their own bonfire parties, and instead attend officially organised bonfire and firework displays. The number of injuries from bonfires and fireworks each year is considerable and...

<p>Further information can be found at the <a href="http://en.wikipedia.org/wiki/bonfire">Wikipedia site</a>.

</p>
</body>
</html>]]></body>
</document>
</request>

Note: There is complexity in passing HTML documents within the XML request regarding the character encoding - there is the encoding used by the XML request (will default to UTF-8 if not specified) and the encoding used by the HTML document. If the HTML encoding is not specified then it will default by the version of HTML - unfortunately, before HTML5 the default encoding was to use the 8-bit ASCII encoding configured by the host operating system. This default very rarely makes sense these days so to avoid encoding issues either specify the encoding within the HTML document itself (preferably as UTF-8) or use html5 which has a sensible default for modern data

Example 3 - HTML Content (Remote)

The content is specified with a URL:

<?xml version="1.0" ?>
<request op="CLASSIFY">
<document>
   <path>http://en.wikipedia.org/wiki/Bonfire</path>
</document>
</request>

Example 4 - Binary Content (Remote)

The binary (non-text) content is specified with a URL:

<?xml version="1.0" ?>
<request op="CLASSIFY">
<document>
   <path>http://localhost/dumped cars.doc</path>
</document>
</request>

Note: The URL must be correctly (%) encoded.

Example 5 - Binary Content (Local)

The local binary (non-text) content is specified with a “file” URL:

<?xml version="1.0" ?>
<request op="CLASSIFY">
<document>
   <path>file://c:/temp/dumped cars.doc</path>
</document>
</request>

Note: The URL must be correctly (%) encoded.

Here is a table of equivalence between HTTP POST form field names and equivalent XML elements in the XML_INPUT request format:

Form Field Name	Equivalent XML Fragment	Notes
title	<title>VALUE</title>
path	<path>VALUE</path>
body	<body>VALUE</body>
type	<body type=“VALUE”>…
clustering_type	<clustering type=“VALUE”>
clustering_threshold	<clustering threshold=“VALUE”>
threshold	<threshold>VALUE</threshold>
language	<language>VALUE</language>	If not using a language pack then this should be a valid language code, e.g. “en”, “en1”, “fr”, “it”, “de”, “nl”, “pt”, “da”, “no”, or “sy”. If left blank then the default system language will be used. If using Classification and Language Service with a language pack then the code provided will need to be a valid ISO639-1, ISO639-2 (T or B) code, the ISO language name or native name (see List_of_ISO_639-1_codes for details). Further, if using a language pack you can leave this value blank and the system will attempt to guess the language of the document (as a best fit from the installed language packs).
debug	<debug>VALUE</debug>
operation	<request op=“VALUE”>
legacy	<legacy/>
singlearticle	<singlearticle/>
multiarticle	<multiarticle/>
feedback	<feedback/>	May optionally contain “TEXTONLY” as value.
meta_<NAME>	<META name=“<NAME>” value=“VALUE”…	Specifies a meta value note.
UploadFile		No equivalent in XML, allows actual file data to be transferred.
XML_INPUT		No equivalent in XML, “VALUE” should be a valid XML request. Note that if this is present then the other fields in form-based submission are not used, that is, this provides a simple mechanism to submit an XML request via an HTTP POST.

CLASSIFY Request Description

The Classification operation is the most common operation. Its purpose is to ask Classification and Language Service to classify the supplied document. Classification and Language Service will:

Acquire the document (document data may be included in the request or specified as a URI)
Determine the document format used and extract all the relevant text data
Optionally split the text into structurally determined articles.
Lexically analyse the text to determine valid sentences/paragraphs etc (and optionally what language is being used)
Search the tokenised text for any text matches specified in the rulenet (including stem-based matching when required)
Evaluate the rulenet (The rules form a tree (or rather a series of trees) with the text rules as leaf nodes - evaluation is the process of “bubbling” up the scores from the text leaf nodes till the top of the tree is reached - for example, a sentence rule will bubble up if all its children text rules have at least one occurrence in the same sentence)
Send a response describing the results

Elements used in example

Element or Attribute	Values	Purpose	Notes
REQUEST		The parent node for the XML	Other than the operation (“OP”) this element may have a document child when applicable to the operation
REQUEST “OP”	CLASSIFY	Calculates the scores for the document using the current rulenet	This is the most common operation
DOCUMENT		All other elements are children of this	Currently, only one document node may be used per request but the original idea was to allow multiple documents within a single request
BODY		Optionally provides the body of the document	If PATH is not specified or the document defined by PATH cannot be fetched then the contents of this element are used to classify. If no BODY is specified and PATH cannot be fetched then an error is generated - this allows requesting process to decide how Classification and Language Service should handle network-related errors

Simple Examples

A very simple form of this request is :-

<?xml version="1.0" encoding="UTF-8" ?>
<request op="classify">
  <document>
   <body>Text for classification</body>
  </document>
</request>

Which asks Classification and Language Service to classify the document consisting of only the text “Text for classification”.

<?xml version="1.0" encoding="UTF-8" ?>
<request op="classify">
  <document>
     <path>http://www.example.com/files/document1.doc</path>
  </document>
</request>

Which asks Classification and Language Service to classify a document retrieved via HTTP from specified server.

Using attribute via query parameter equivalent

To use via query parameters this attribute actually has name OPERATION rather than OP - however, OP=“CLASSIFY” is assumed to be the default if not specified. e.g.

http://localhost:5058?body=Text for classification

is equivalent to:

http://localhost:5058?operation=classify&body=Text for classification

Specifying via URL encoded fields swiftly becomes painful since URI escaping is required for “:\ and space etc so using something like curl is much easier. e.g.

curl -d body="Text for classification" http://localhost:5058

when curl automatically URI escapes the -d, fields are much easier in practice.

Example Response

<?xml version="1.0" encoding="UTF-8"?>

<response>
 <STRUCTUREDDOCUMENT>
  <URL>../tmp/1283437589_16fc</URL>
  <META name="Type" value="TEXT"/>
  <SYSTEM name="Template" value="default"/>
  <ARTICLE>
  </ARTICLE>
 </STRUCTUREDDOCUMENT>
</response>

Since this is a success response a STRUCTUREDDOCUMENT will be in response. The URL will differ for every classification since this is simply where the body text was written to on the Classification and Language Service server. The “META” “Type” has value “TEXT” because Classification and Language Service determined that the body text was purely text. The “SYSTEM” “Template” has a value default which shows which splitting template was used to process the document. In this case, no classifications were found either at the article level or at the document level

Note that far more information may be provided in the response - the type (and level of detail) of the information returned may be altered by various request parameters - see the specific parameters for details.

Results returned from Classification and Language Service

On receiving a “classify” request Classification and Language Service returns something like the following (the full DTD for the XML returned is found in the Appendix).

<?xml version="1.0" encoding="UTF-8"?>
<response>
 <STRUCTUREDDOCUMENT>
  <URL>../tmp/Accessing Public Transport.pdf</URL>
  <META name="Number Of Pages" value="4"/>
  <META name="Type" value="PDF"/>
  <SYSTEM name="WordFinder Version" value="4"/>
  <SYSTEM name="Secure PDF" value="No"/>
  <SYSTEM name="Template" value="default"/>
  <META name="IPSV" value="Buses" id="OMITERMO2830" score="0.98"/>
  <META name="IPSV" value="Disability" id="OMITERMO3471" score="0.85"/>
  <META name="IPSV" value="Employees" id="OMITERMO1001" score="0.65"/>
  <META name="IPSV" value="Information and communication technology" id="OMITERMO822" score="0.50"/>
  <META name="IPSV" value="Public transport" id="OMITERMO1374" score="0.90"/>
  <META name="IPSV" value="Rail transport" id="OMITERMO943" score="0.50"/>
  <META name="IPSV" value="Transport and infrastructure" id="OMITERMO521" score="0.53"/>
  <META name="IPSV" value="Vehicles" id="OMITERMO1406" score="0.79"/>
  <META name="IPSV" value="Wheelchairs" id="OMITERMO3551" score="0.60"/>
  <ARTICLE>
   <SYSTEM name="DeterminedLanguage" value="english"/>
   <SYSTEM name="LanguageGuessed" value="yes but not fully confident"/>
   <META name="IPSV" value="Buses" id="OMITERMO2830" score="0.98"/>
   <META name="IPSV" value="Disability" id="OMITERMO3471" score="0.85"/>
   <META name="IPSV" value="Employees" id="OMITERMO1001" score="0.65"/>
   <META name="IPSV" value="Information and communication technology" id="OMITERMO822" score="0.50"/>
   <META name="IPSV" value="Public transport" id="OMITERMO1374" score="0.90"/>
   <META name="IPSV" value="Rail transport" id="OMITERMO943" score="0.50"/>
   <META name="IPSV" value="Transport and infrastructure" id="OMITERMO521" score="0.53"/>
   <META name="IPSV" value="Vehicles" id="OMITERMO1406" score="0.79"/>
   <META name="IPSV" value="Wheelchairs" id="OMITERMO3551" score="0.60"/>
  </ARTICLE>
 </STRUCTUREDDOCUMENT>
</response>

It should be noted that classification information is returned at both the “article” level and the document level. “Articles” are sections of the document specified in the request that is separately classified (how this information is specified is using the “document splitting” functionality of Classification and Language Service the discussion of which is beyond the scope of this document). The document-level classification information is a consolidated version of the classifications returned for all sections.

In this case, we are only interested in “IPSV” classifications (for example) so we would specifically look for the “<META name=”IPSV” …” information returned from Classification and Language Service (the rest of the classifications can be ignored, in this case).

Semaphore Classification and Language Service (CLS)