Configure the Classification and Language Service (CLS)
- Last Updated: May 13, 2026
- 4 minute read
- Semaphore
- Documentation
Introduction
This document briefly describes how to configure Classification Server.
The configuration settings are stored in the Classification Server conf/config.xml file (in Windows this is generally C:\Program Files\Smartlogic\Classification Server\conf\config.xml and in Linux this is /etc/opt/semaphore/CS/config.xml). Updating these files will alter the way Classification Server operates. After updating, Classification Server should be restarted to reflect any changes made.
Audience
This guide is intended for system administrators who will be responsible for the configuration of Semaphore Classification Server.
Document parsing options
The acquisition bean defined in the config.xml file determines how the textual content of an incoming document is obtained. One of the stages of this acquisition is the parsing of binary document formats. The following parsing options are supported:
-
ORACLE_CONTENT_ACCESS(default): This parser is ideal for rapidly extracting the data from many different document formats. However, it is incapable of extracting table data, so it is not possible to evaluate rules that rely on table structure. When encountering elements within tables, they are presented as simple text. -
ORACLE_XML_EXPORT: This parser is capable of handling table data from any document format other than PDF. However, it offers slower performance than the content access method, and, therefore, it should only be used if table data is needed.Note: If table data is required from PDFs, then PDFAlchemist can be defined as an external parser. The PDFAlchemist executable is included with the Classification Server, and it can be enabled using the configuration file. Note that using PDFAlchemist as an external parser can have a significant impact on the time taken to process a document.
-
HYLAND(available for Semaphore 5.10.2 and later): This parser processes documents in a similar way toORACLE_XML_EXPORTand extracts many table forms across documents. Using this method allows out-of-the-box table extraction for most document formats, except HTML.Note: Running the PDFAlchemist in conjunction with the Hyland parser is not supported.
Forking data to a MarkLogic database
Since Semaphore version 5.6, it is possible to tee the results from Classification to a MarkLogic database. Instructions on how to do this are in the file <CS Installation>/ml/read.me, once CS has been installed. (Instructions are not given here as they are liable to be version dependent.)
Classification Server logging
| Setting | Details |
|---|---|
| Default Configuration File |
<Installation Directory>\conf\config.xml |
| Default Log Location |
<Installation Directory>\logs |
| Log Levels |
DEBUG - Most verbose logging INFO - Information logging messages NOTICE - Notification logging messages WARN - Warning messages (recommended setting for production use) ERROR - Error messages CRIT - Critical (error) messages ALERT - Alerting messages (not used) FATAL - Fatal (error) messages |
| Log File Name |
crt.log - Contains a log of all (successful) requests to CS. runtime.log - Contains a log of all CS activity. |
| Log File Format (crt.log) |
Layout is configurable in the configuration file, default is as follows (CSV format): 0,<Finish time in yyyy-MM-dd HH:mm:ssZ“>,<Source IP Address>,<Operation>,<Time Taken>,<File Name>,<URL>,<Meta Original URL>,<Document Hash>,<Audit Tag>,<Scores> If error, then the format is as follows: <Error Number>,<Finish time in yyyy-MM-dd HH:mm:ssZ”>,<Source IP Address>,<Operation>,<Time Taken>,<File Name>,<URL>,<Meta Original URL>,<Audit Tag>,<Error Component>,<Error Message> |
| Log File Format (runtime.log) |
<Request date in dd Mmm YYYY HH:mm:ss.sss> [<thread>] [component] : <Message> |
Note: For any classification server failures, the “failures” directory (set in the “failure_dir” setting) contains a file with the request and (if possible) another file containing the original document. Similarly you can configure classification server to save any successful requests (for debugging purposes) by un-commenting the “success_dir” settings from the default configuration file.
Classification Server configuration settings
The (default) classification server configuration settings are set in the file:
<installation directory>\conf\config.xml
This XML file is read on start up of the service. All settings can be modified in this file and the file is fully commented to indicate the impact of the properties (in the configuration file these appear as <property name=“property-name”>). Some of the key properties are listed here:
| Property | Default | Description |
|---|---|---|
| port | 5058 | The port on which the classification server will listen. Set to a value suitable for your firewall and security considerations. |
| only_accept_ip | 127.0.0.1 | The IP address of machines that can access and publish to the Classification Server instance. Uncomment this section and add addresses if required. |
| workers | 2 | The number of CPU cores allocated to the Classification Server instance. For example, on a quad-core server the configuration could be to allow 3 Classification Server “worker” threads, leaving one for Ontology Server on the same physical server. |
| threshold | 48 | The weighting threshold above which classifications are returned by the server. Typically this setting is not changed. Testing interfaces can override the value for the duration of the test. |
| singlearticle | false | The default behavior is to split content into multiple articles. If this is not required, for example, all the content is single HTML pages, set this property to True. |
| clustering_type | RMS | * “ALL” Indicates that all the categories defined for all articles will be propagated at document level. * “AVERAGE” Indicates that the average score (by article contribution) across all articles will be recalculated for each category and the category propagated at document level if its standard average is above the clustering threshold. * “COMMON” Indicates that only the categories in common between all articles of the document will be propagated at document level with average score over all articles (if that average is above the clustering threshold). * “NONE” Disables clustering of article metadata. * “RMS” Indicates that the average score (by article contribution) across all articles will be recalculated for each category and the category propagated at document level if its root mean square average is above the clustering threshold. |
| FieldsToHash | .* | This sets the fields which CS will look in to use for its hash calculation. The default is all textual fields (“.*”). If you wish to be more specific, it requires a regex to match those fields. For example, if we had two properties Author and ModDate, then we could write “Auth.*|Mod.*”. This would mean that CS is only using those two properties to calculate its hash. |
CSTI (Classification Server Test Interface) “Fields” Setting
To configure the fields that are available on the CSTI UI, add or modify the <value> list in this bean:
<bean id="displayConfiguration" class="config_only" >
<property name="MetaFields" >
<list>
<value>Example</value>
</list>
</property>
</bean>