Powered by Zoomin Software. For more details please contactZoomin

Semaphore Classification and Language Service (CLS)

Appendix - XML DTD

  • Last Updated: May 13, 2026
  • 6 minute read
    • Semaphore
    • Documentation

Request XML DTD

The XML DTD of requests to Classification Server is:

<!-- Classification Server (Version 7.13) Request DTD -->
<!-- co 2011 Smartlogic Semaphore Ltd -->

<!-- High level element is normally "request" -->

<!ELEMENT request (document?)>
<!ATTLIST request
          op (CLASSIFY | PUBLISH | TEST | PUBLISH_ADDITION | STATS | LISTRULENETCLASSES) #REQUIRED>

<!-- Legacy requests have a high level element of "document" -->

<!ELEMENT document (
  title?, 
  path?, 
  body?, 
  feedback?,
  singlearticle?,
  multiarticle?,  
  min_average_article_pagesize?,
  num_articles_processed_in_singlepass?,
  char_count_cutoff?,
  stylesheet?,
  use_generated_keys?,
  language?,
  debug?,
  splitting_template?,
  operation_mode?,
  clustering,
  document_score_limit?,
  empty_article_ignores_metadata?,
  threshold?,
  META* )>

<!ELEMENT title (#PCDATA)>
<!ELEMENT path (#PCDATA)>
<!ELEMENT body (#PCDATA)>
<!ATTLIST body
          type (TEXT | HTML) "TEXT" >
<!ELEMENT feedback (#PCDATA)>
<!ELEMENT singlearticle EMPTY>
<!ELEMENT multiarticle EMPTY>
<!ELEMENT min_average_article_pagesize (#PCDATA)>
<!ELEMENT num_articles_processed_in_singlepass (#PCDATA)>
<!ELEMENT char_count_cutoff (#PCDATA)>
<!ELEMENT stylesheet EMPTY>
<!ELEMENT use_generated_keys EMPTY>
<!ELEMENT language (#PCDATA)>
<!ELEMENT debug (#PCDATA)>
<!ELEMENT splitting_template (#PCDATA)>
<!ELEMENT operation_mode (#PCDATA)>
<!ELEMENT clustering EMPTY>
<!ATTLIST clustering
          type (ALL | AVERAGE | COMMON | NONE | RMS | AVERAGE_INCLUDING_EMPTY | COMMON_INCLUDING_EMPTY | RMS_INCLUDING_EMPTY) "RMS"
          threshold CDATA "48" >
<!ELEMENT document_score_limit (#PCDATA)>
<!ELEMENT empty_article_ignores_metadata (#PCDATA)>
<!ELEMENT threshold (#PCDATA)>
<!ELEMENT META EMPTY>
<!ATTLIST META
          name CDATA #REQUIRED
          value CDATA #REQUIRED >


Note: This DTD is accessible via URL cs_7_13_request.dtd.

The elements and attributes have the following meaning:

  • OPERATION has the following values (case insensitive):
    • “CLASSIFY” - Classify a document
    • “PUBLISH” - Publish/republish a rulebase
    • “COLLECT” - Collect the classification statistics
    • “TEST” - Classify a document with diagnostics mode on
    • “STATS” - Return statistics regarding classification.
  • TITLEdefines the title of the document. If a title is found in the document defined in PATH the value of TITLE will override it.
  • BODYwill be treated as the body of the document if PATH is not specified or if the document defined by PATH cannot be fetched. The nature of the BODY is defined by @TYPE
  • BODY@TYPE indicates how the provided BODY should be treated:
    • “UNKNOWN” - Have Classification Server guess the format of the BODY (this is the default if no TYPE is specified).
    • “TEXT” - Treat the BODY as text data.
    • “HTML” - Treat the BODY as HTML data.
  • PATH is treated as the URI of the document to be classified. Supported protocols are FTP, FTPS, TFTP, HTTP, HTTPS, GOPHER, TELNET, DICT, FILE and LDAP. Note that FTPS & HTTPS are only available if an SSL implementation is available on the server.
  • If SINGLEARTICLE is present in the request, the classification will not attempt to split the document into articles, resulting in the document being classified as a whole. Ssimilarly, if SINGLEARTICLE is present the system will attempt to split the document into articles, this is the default behaviour (these two options are mutually exclusive).
  • If FEEDBACK is present in the request, the output will not only be populated with the classification results, but will also include the text from the document with auditing information.
  • If STYLESHEET is present in the request, the classification server output will include a stylesheet definition so the requesting browser can perform a client-side XSL transform if supported.
  • THRESHOLD defines the minimum score a category should reach before being included in the results, if this not specified the THRESHOLD defined in the configuration is used.
  • CLUSTERING defines how the classification of articles is propagated at the document level when multiple articles are found in a document and processed separately. If CLUSTERING is not present in the request the values defined in the Classification Server configuration file are used.
  • CLUSTERING@TYPE has the following values (case insensitive):
    • “ALL” - Indicates that all the categories defined for all articles will be propagated at document level.
    • “AVERAGE” - Indicates that the average score (by article contribution) across all non-empty articles will be recalculated for each category and the category propagated at document level if its standard average is above the clustering threshold.

Error XML DTD

The XML DTD of the error response generated by Classification Server is as follows:

<!-- Classification Server (Version 7.8) Error DTD -->
<!-- co 2010 Smartlogic Semaphore Ltd -->

<!-- High level element is "results" -->

<!ELEMENT results (error)>
<!ATTLIST results
          name CDATA #REQUIRED> 

<!ELEMENT error (#PCDATA)>
<!ATTLIST error
          id CDATA #REQUIRED>


Note: This D TD is accessible via URL cs_7_8_error.dtd.

Response XML DTD

The XML DTD of the response generated by Classification Server is as follows:

<!-- Classification Server (Version 4.1.20) Response DTD -->
<!-- co 2017 Smartlogic Semaphore Ltd -->

<!-- High level element is "response" -->

<!ELEMENT response (#PCDATA | STRUCTUREDDOCUMENT | Overall | Acquisition | DateZoner | Evaluation | Finalisation | Lexer | Parser | Splitter | languages)*>

<!-- Standard "classify" request output -->

<!ELEMENT STRUCTUREDDOCUMENT (URL, HASH?, (META | SYSTEM | rule_evidence)*, 
                             ARTICLE*,( PARAGRAPH | OBJECT | FIELD | EMAIL_FIELDS )*)>

<!ELEMENT URL (#PCDATA)>
<!ELEMENT META (META*)>
<!ATTLIST META
          name CDATA #REQUIRED
          value CDATA #REQUIRED
          id CDATA #IMPLIED
          score CDATA #IMPLIED
          CandidateKey CDATA #IMPLIED
          original_key CDATA #IMPLIED
          key CDATA #IMPLIED >

<!ELEMENT SYSTEM EMPTY>
<!ATTLIST SYSTEM
          name CDATA #REQUIRED
          value CDATA #REQUIRED >

<!ELEMENT HASH EMPTY>
<!ATTLIST HASH
          value CDATA #REQUIRED >

<!ELEMENT ARTICLE (TITLE?, (META | SYSTEM | rule_evidence)*, (PARAGRAPH|OBJECT|FIELD|EMAIL_FIELDS)* ) >

<!ELEMENT EMAIL_FIELDS ((PARAGRAPH|FIELD)* )>

<!ELEMENT rule_evidence ( Clustered | (rule | EvidenceTruncated)*)>
<!ATTLIST rule_evidence
          category CDATA #REQUIRED
          class CDATA #REQUIRED >

<!ELEMENT Clustered ( ArticleDetails* )>
<!ATTLIST Clustered
          type CDATA #REQUIRED >

<!ELEMENT ArticleDetails EMPTY>
<!ATTLIST ArticleDetails
  Index CDATA #REQUIRED
  score CDATA #REQUIRED
  NonEmptyScores CDATA #REQUIRED
    >

<!ELEMENT rule EMPTY>
<!ATTLIST rule
          key CDATA #REQUIRED
          type CDATA #REQUIRED
          score CDATA #REQUIRED
          index CDATA #IMPLIED
          RuleBase CDATA #IMPLIED
          id CDATA #IMPLIED
          depth CDATA #IMPLIED
          original_key CDATA #IMPLIED
          evaluated CDATA #IMPLIED
          CandidateKey CDATA #IMPLIED
          NodeIndex CDATA #IMPLIED
          Offset CDATA #IMPLIED
          triggers CDATA #IMPLIED
          subtype CDATA #IMPLIED
          data CDATA #IMPLIED >

<!ELEMENT EvidenceTruncated EMPTY>
<!ATTLIST EvidenceTruncated
          TotalEvidenceRules CDATA #REQUIRED 
          EvidenceRulesAdded CDATA #IMPLIED>

<!ELEMENT FIELD (#PCDATA|KEY|FIELD|PARAGRAPH)*>
<!ATTLIST FIELD NAME CDATA #REQUIRED >

<!ELEMENT TITLE (PARAGRAPH*)>
<!ELEMENT PARAGRAPH (#PCDATA|KEY|FIELD)*>
<!ELEMENT OBJECT ((PARAGRAPH|OBJECT|FIELD|EMAIL_FIELDS)*)>
<!ELEMENT KEY (#PCDATA|KEY)*>
<!ATTLIST KEY
          ID CDATA #REQUIRED >

<!-- Statistics response (request "stats") -->

<!ELEMENT Overall (Classify | Exception | Publish | count | http | value)*>
<!ELEMENT Classify (count | value)*>
<!ELEMENT count (#PCDATA)>
<!ELEMENT value (#PCDATA)>
<!ELEMENT Exception (#PCDATA | Data_Kept)*>
<!ELEMENT Data_Kept (#PCDATA)>
<!ELEMENT Publish (count | value)*>
<!ELEMENT http (#PCDATA)>

<!ELEMENT Acquisition (Exception | count | value)*>

<!ELEMENT DateZoner (count | value)*>

<!ELEMENT Evaluation (Processed | count | value)*>
<!ELEMENT Processed (#PCDATA)>

<!ELEMENT Finalisation (articles_processed | count | documents_processed | value)*>
<!ELEMENT articles_processed (#PCDATA)>
<!ELEMENT documents_processed (#PCDATA)>

<!ELEMENT Lexer (Units_Processed | count | value)*>
<!ELEMENT Units_Processed (#PCDATA)>

<!ELEMENT Parser (count | pdf | processed | text | value)*>
<!ELEMENT pdf (#PCDATA)>
<!ELEMENT processed (#PCDATA)>
<!ELEMENT text (#PCDATA)>

<!ELEMENT Splitter (ArticlesMade | DocumentsSplit | count | value)*>
<!ELEMENT ArticlesMade (#PCDATA)>
<!ELEMENT DocumentsSplit (#PCDATA)>

<!-- Legacy response -->

<!ELEMENT results (class?, error?, error_detail?, version?, warnings?)>
<!ATTLIST results name (error | processdocument | version) "processdocument">

<!ELEMENT warnings (warning*)>
<!ELEMENT warning (warning_detail)>
<!ATTLIST warning id CDATA #REQUIRED>
<!ELEMENT warning_detail (#PCDATA)>

<!-- NOTE: Non-legacy requests will return errors using this format also. -->

<!ELEMENT error (#PCDATA)>
<!ATTLIST error id CDATA #REQUIRED>
<!ELEMENT error_detail (#PCDATA)>

<!ELEMENT version EMPTY>
<!ATTLIST version number CDATA #REQUIRED>

<!ELEMENT class (term*)>
<!ATTLIST class name CDATA #REQUIRED>

<!ELEMENT term EMPTY>
<!ATTLIST term 
          name CDATA #REQUIRED
          score CDATA #REQUIRED>

<!-- Elements used by language request response -->

<!ELEMENT languages (language)*>
<!ATTLIST languages 
          type (Language_Pack|Standard) #REQUIRED>

<!ELEMENT language EMPTY>
<!ATTLIST language 
          id CDATA #REQUIRED
          name CDATA #REQUIRED
          display CDATA #IMPLIED    
          default (true) #IMPLIED
          has_rules_defined (true) #IMPLIED>                   

Note: This DTD is accessible via URL cs_4_1_20_response.dtd.

Notes:

  1. The information returned from a “debug” request that is included in this schema may change at any time due to the nature of the information being provided so for this type of request this schema should be treated as simply a guideline that may not be strictly adhered to.
  2. META elements are used to store the classification results. META elements can be found at STRUCTUREDDOCUMENT or ARTICLE level. If the STRUCTUREDDOCUMENT contains articles then the document level META elements are derived from the ARTICLE level ones, as per the aggregation parameters set out in the XML request. CandidateKey is used as an alternative key for markup when the scored rule is a template rule. This is because a particular rule may be scored for several candidates and using this key allows the evidence for each scored candidate to be displayed.
  3. SYSTEM elements are used to store some properties of the document, such as its nature (PDF, WORD etc), its author when available etc…
  4. Rule evidence is only output in diagnostics mode and provides the full list of subsidiary rules which evaluate up to the particular category level - these are individually marked up in the text where appropriate.
  5. The “KEY” elements are used to surround elements of text identified as evidence by the rulebases for auditing purposes. Their identifier matches the key attribute of the META element.
  6. “OBJECTS” are returned when the incoming document contains embedded objects (typically within Microsoft Office documents). If the nested object is of an unrecognised format (e.g. an embedded sound or image) then the returned object will contain a single warning message within a paragraph object.
  7. “OBJECT”, “TITLE” and “PARAGRAPH” elements are only present if feedback was requested in the incoming request.
TitleResults for “How to create a CRG?”Also Available inAlert