Configure the Classification and Language Service (CLS)

Save PDF

Last Updated: May 13, 2026
4 minute read

Semaphore
Documentation

Introduction

This document briefly describes how to configure Classification Server.

The configuration settings are stored in the Classification Server conf/config.xml file (in Windows this is generally C:\Program Files\Smartlogic\Classification Server\conf\config.xml and in Linux this is /etc/opt/semaphore/CS/config.xml). Updating these files will alter the way Classification Server operates. After updating, Classification Server should be restarted to reflect any changes made.

Audience

This guide is intended for system administrators who will be responsible for the configuration of Semaphore Classification Server.

Document parsing options

The acquisition bean defined in the config.xml file determines how the textual content of an incoming document is obtained. One of the stages of this acquisition is the parsing of binary document formats. The following parsing options are supported:

ORACLE_CONTENT_ACCESS (default): This parser is ideal for rapidly extracting the data from many different document formats. However, it is incapable of extracting table data, so it is not possible to evaluate rules that rely on table structure. When encountering elements within tables, they are presented as simple text.
ORACLE_XML_EXPORT: This parser is capable of handling table data from any document format other than PDF. However, it offers slower performance than the content access method, and, therefore, it should only be used if table data is needed.

Note: If table data is required from PDFs, then PDFAlchemist can be defined as an external parser. The PDFAlchemist executable is included with the Classification Server, and it can be enabled using the configuration file. Note that using PDFAlchemist as an external parser can have a significant impact on the time taken to process a document.
HYLAND (available for Semaphore 5.10.2 and later): This parser processes documents in a similar way to ORACLE_XML_EXPORT and extracts many table forms across documents. Using this method allows out-of-the-box table extraction for most document formats, except HTML.

Note: Running the PDFAlchemist in conjunction with the Hyland parser is not supported.

Forking data to a MarkLogic database

Since Semaphore version 5.6, it is possible to tee the results from Classification to a MarkLogic database. Instructions on how to do this are in the file <CS Installation>/ml/read.me, once CS has been installed. (Instructions are not given here as they are liable to be version dependent.)

Classification Server logging

Setting	Details
Default Configuration File	<Installation Directory>\conf\config.xml
Default Log Location	<Installation Directory>\logs
Log Levels	DEBUG - Most verbose logging INFO - Information logging messages NOTICE - Notification logging messages WARN - Warning messages (recommended setting for production use) ERROR - Error messages CRIT - Critical (error) messages ALERT - Alerting messages (not used) FATAL - Fatal (error) messages
Log File Name	crt.log - Contains a log of all (successful) requests to CS. runtime.log - Contains a log of all CS activity.
Log File Format (crt.log)	Layout is configurable in the configuration file, default is as follows (CSV format): 0,<Finish time in yyyy-MM-dd HH:mm:ssZ“>,<Source IP Address>,<Operation>,<Time Taken>,<File Name>,<URL>,<Meta Original URL>,<Document Hash>,<Audit Tag>,<Scores> If error, then the format is as follows: <Error Number>,<Finish time in yyyy-MM-dd HH:mm:ssZ”>,<Source IP Address>,<Operation>,<Time Taken>,<File Name>,<URL>,<Meta Original URL>,<Audit Tag>,<Error Component>,<Error Message>
Log File Format (runtime.log)	<Request date in dd Mmm YYYY HH:mm:ss.sss> [<thread>] [component] : <Message>

Note: For any classification server failures, the “failures” directory (set in the “failure_dir” setting) contains a file with the request and (if possible) another file containing the original document. Similarly you can configure classification server to save any successful requests (for debugging purposes) by un-commenting the “success_dir” settings from the default configuration file.

Classification Server configuration settings

The (default) classification server configuration settings are set in the file:

<installation directory>\conf\config.xml

This XML file is read on start up of the service. All settings can be modified in this file and the file is fully commented to indicate the impact of the properties (in the configuration file these appear as <property name=“property-name”>). Some of the key properties are listed here:

Property	Default	Description
port	5058	The port on which the classification server will listen. Set to a value suitable for your firewall and security considerations.
only_accept_ip	127.0.0.1	The IP address of machines that can access and publish to the Classification Server instance. Uncomment this section and add addresses if required.
workers	2	The number of CPU cores allocated to the Classification Server instance. For example, on a quad-core server the configuration could be to allow 3 Classification Server “worker” threads, leaving one for Ontology Server on the same physical server.
threshold	48	The weighting threshold above which classifications are returned by the server. Typically this setting is not changed. Testing interfaces can override the value for the duration of the test.
singlearticle	false	The default behavior is to split content into multiple articles. If this is not required, for example, all the content is single HTML pages, set this property to True.
clustering_type	RMS	* “ALL” Indicates that all the categories defined for all articles will be propagated at document level. * “AVERAGE” Indicates that the average score (by article contribution) across all articles will be recalculated for each category and the category propagated at document level if its standard average is above the clustering threshold. * “COMMON” Indicates that only the categories in common between all articles of the document will be propagated at document level with average score over all articles (if that average is above the clustering threshold). * “NONE” Disables clustering of article metadata. * “RMS” Indicates that the average score (by article contribution) across all articles will be recalculated for each category and the category propagated at document level if its root mean square average is above the clustering threshold.
FieldsToHash	.*	This sets the fields which CS will look in to use for its hash calculation. The default is all textual fields (“.”). If you wish to be more specific, it requires a regex to match those fields. For example, if we had two properties Author and ModDate, then we could write “Auth.\|Mod.*”. This would mean that CS is only using those two properties to calculate its hash.

CSTI (Classification Server Test Interface) “Fields” Setting

To configure the fields that are available on the CSTI UI, add or modify the <value> list in this bean:

<bean id="displayConfiguration" class="config_only" >
  <property name="MetaFields" >
    <list>
      <value>Example</value>
    </list>
  </property>
</bean>