Article handling and splitting

Save PDF

Last Updated: May 13, 2026
10 minute read

Semaphore
Documentation

Overview

Generally documents are split into articles for classification (Though singlearticle mode may be requested if required).

Each article is then classified independently, that is, the graph of rules is evaluated for the text in each article. The resulting fired categories are then grouped via a statistical calculation to determine the entire documents classification.

The main benefit of using articles is the reduction of complexity required when writing the rules themselves.

If articles are not used there is quite a difference in the number of words in say a 200 page pdf report and a 2 paragraph email. Whilst it is possible to write rules which take this into account it is much simpler to always assume that the document is “small”. What small actually means is less important - as long as “large” documents are broken into “smaller” chunks somehow.

Various algorithms have been tried to determine sensible article breaks - however lexical clustering (which can work well for identifying article breaks in a newspaper style document) turned out to not work too well in practice instead using the inherent document structure (so clues like headers in larger fonts, underlining, Title casing etc gives better results.

There is also support available for identifying specific document types and controlling the splitting into articles based on a document template (an XML document given to CS) rather than letting CS guess automatically.

In certain situations using document templates to identify and split into articles can work extremely effectively. So, for example, if a large number of documents sent for classification are of a few specific types and formats which contain a large amount of boiler plate text writing a template which identifies these, splits into the appropriate articles and marks the boiler-plate text as text not for classification can work well.

Typically though we do not recommend this approach mainly due to issues with maintainability. Our experience has been that using document templates may work well for a period of time however over time the formats used for documents may well change, the people changing these formats often have no idea that changing this will affect classification and so updating the article splitting templates is not high on their agenda.

The following documents the language and structure used to write document templates to control the splitting if that is required

Detailed process flow

The article handling component of the classification process is responsible for:

Using the features found in the normalised document assess if the document is an instance of one the templates
If a template is identified, use the template definition to
- Split the document on the defined break points
- Optionally remove the headers and footers of each page
- Optionally remove the table of content if defined or present
- Honour the defined exclusions such as disclaimers, revision control sections or lengthy tables of numbers etc
If the document isn’t matching a template
- Analyse the distribution of style attributes to identify section headers and other prominent feature
- Use the identified features as natural article break-points.
- If the average size of found articles falls below the classification request specified level then revert this to single article
Apply any determinators - these allow identification of articles after any splitting - and may apply equally to statistical splitting or template based splitting
Perform the determinator actions - may exclude articles or change the field that body text for article will be classified in
Add the article definitions to the normalized document description

Template XML description

The description of what to use to identify a document as a member of a class of document, and how to therefore split it will be contained in an xml document. The valid elements are :-

    document-template
        title
        meta
        headers
            header
        footers
            footer
        split-point
        determines

The elements are all either potential split points, identifiers or determinators. Each node should have attributes that provide sufficient detail to locate them within a document using the template. Those marked as identifiers are required - if they are all there we have a match, if some are missing, we continue searching.

For example:

<document-template file="newsletter.dot" name="Weekly market roundup">
    <title font-size="16" font-weight="bold" value="Weekly market roundup" page="1" identifier="true"/>
    <meta name="Location" value=".*_EU" identifier="true"/>
    <headers>
        <header font-size="10" value="Investment Grade Credit Research" page="1" identifier="true"/>
        <header font-size="12" value="Weekly market roundup" identifier="true"/>
    </headers>
    <footers>
        <footer font-size="8" value="www.somebank.com Please refer to important
          information found at the end of the report." page="1" identifier="true"/>
        <footer font-size="12" value="Investment Grade Credit Research" identifier="true"/>
    </footers>
    ...

If the two headers, the two footers, the title and the Location META match (all are required) we have a match and this template would be used to split the document.

Generally it is better to be as specific as possible with your identifiers - simply looking for a title containing “test” would be a poor template in production use since many documents from different sources could have a title containing the word “test” and so would get split by the template.

The next part of the xml describes how to split it - for the above template, for example, font size and text value should be enough to isolate the section. All subsequent text of the same or smaller font size is treated as a section (unless it fits the description of one of the following or other split-points).

    ...
    <split-point  font-size="16" value="Highlights"/>
    <split-point  font-size="16" value="Traders' comments"/>
    <split-point  font-size="16" value="Opinions/Analysis">
        <split-point  font-size="10" font-weight="bold" />
    </split-point>
    <split-point  font-size="16" value="Financial recommendations"/>
    <split-point  font-size="16" value="Credit charts">
        <split-point  font-size="11" value="Bank Calendar"/>
        <split-point  font-size="11" value="Rating changes"/>
        <split-point  font-size="11" value="New issue monitor">
            <split-point  font-size="10" value="Launched and Priced"/>
            <split-point  font-size="10" value="Pipeline"/>
        </split-point>
    </split-point>
    <split-point  font-size="11" value="IMPORTANT DISCLOSURES" identifier="true" exclude="true"/>
    ...
    <determines action="change_field_for_article" field_name="body/body_recommendations">
       <title font-size="16" value="Financial recommendations"/>
    </determines>
</document-template>

Once a template is matched to a particular document (using the identifiers) then each paragraph of the document is checked in turn.

If the paragraph matches a header or footer definition then the paragraph is removed. Note this behaviour has become less important since CS version 7.4 since any header/footers determined by the parsers are automatically added as a seperate field by the system irrespective of whether any template matching is performed or not. This means that body restricted rules will not fire on text in these headers footers so removal of the text is less important within the splitter.

If the paragraph matches a split point then a new article is started at this point. If the split-point has children split-points defined then these are also checked for subsequent paragraphs until a split point match with a sibling is found (ie these child split-points only have scope within the parent article)

The attributes that may be used for a split-point are :-

Attribute	Use	Notes
font-size	specifies a font-size for the paragraph	only paragraphs that have this size will be considered
page	specifies the page paragraph occurs on	This information is only available for pdf documents
font-style	specifies the style of font used by the paragraph	May be some combination of “bold”,“italic” or “underlined” - note only the 1st word of the paragraph is checked
style-name	specifies the name of the style used	currently style names are only retrieved from inso documents (word etc)
value	specifies a regex match for the text of the paragraph	see boost documentation for details of regex syntax. Note this is a match rather than a regex search on each paragraph in the document. This means that the paragraph has to match exactly the specified value rather than simply contain the value - to perform the equivelent of a search use regex syntax to specify what may be ignored eg “.Financial recommendations.” would match on any paragraph containing the text “Financial recommendations” (.* means any character in any number of positions)
exclude	excludes the article following the split point	a value of “true” means exclude following article - note determinators give a more complete mechanism for managing excludes
name	names a metafield to check	attribute only has meaning to a META identifier (or determinators)
pop	If set to “true” will remove child split points when matched	This allows a child split point to stop siblings from being found later in document - new in 7.7
remove_value	If set to “true” will skip the matched paragraph	ie it will start a new article at the next paragraph (which will become the article title) and will remove the matched value from the document - new in 7.7

The regular expression matches are unicode aware so may use any unicode character to specify the match (also are a few extra regular expression controls which only apply to unicode text)

Example of exclusion :

 <split-point  font-size="11" value="IMPORTANT DISCLOSURES" identifier="true" exclude="true"/>

Would prevent the IMPORTANT DISCLOSURES section of the document from reaching the classification stage of the process and therefore prevent it from affecting the classification results so the results can later be compared to the disclosure statements by for example a workflow engine.

This exclusion may also be done by using a determinator for example:

    <determines action="change_field_for_article" field_name="body/body_recommendations">
       <title font-size="16" value="Financial recommendations"/>
       <paragraph value=".*recomends.*"/>
    </determines>

Unlike split-points determinators are applied after all the split-points are found and the document has been split into articles and may be applied to multiple template usages (including default splitting which uses statistical analysis of font sizes. ie determinators are not used to determine split-points but only to perform actions on already split articles.

The syntax for determinators is similar to that for split-points.

The determines tag may have the following attributes:-

Attribute	Purpose	Notes
name	names a determinator	names only currently have use for reporting purposes - when an action is performed this name is output to the runtime log in a message specifying the action taken
applies_to	specifies the name(s) of template(s) to which this determinator applies	if not specified then determinator only applies to the template in which it is created. The names of templates is a comma seperated list and supports *and? wild cards
action	specifies the action to take once an article is determined	See table below for possible actions
field_name	specifes an alternative name for the “body” field to be used by the article	only used with appropriate action - note field name uses an XPath style syntax so “body/body_introduction” will use a child field of body to contain the text. NB it is good practise (but not required) to start these field name with the text “body_” since the display of feedback evidence will handle any field that begins with body as if it were body text - ie is not hidden when fields are not displayed etc

determines tag may have the following children:-

Tag	Purpose	Notes
title	specifies a match for the article title	note is the article title and not the document title - syntax is the same as that for a split-point ie font-size, value etc
paragraph	specifies a match which is checked against each paragraph of text in the article	uses the same format as split-point
article_index	specifes a match based on the index of the article in the document

Note a determinator must be fully matched (ie all children tags found) before an action is performed.

The valid actions for a determinator are:-

Action	Purpose	Notes
delete_article	deletes current article	same behaviour as exclude=“true” for a split point - but may be applied to articles split in default manner
delete_article_and_all_following	deletes determined article and all subsequent articles in the document
delete_article_and_all_previous	deletes all previous articles and the determined article
delete_all_previous	deletes all previous articles but leaves the determined article	determined article becomes the 1st article for the document
delete_article_and_all_from_marked	deletes all articles from a marked article including the currently determined article	marked article is one found earlier in the document by another determinator with action mark_article. Note currently only a single marked article is supported (ie named marking not supported) if no article has been marked then this action will report an error and not remove any articles
delete_all_from_marked	deletes all article from (and including) marked article but leaves currently determined article in the document
mark_article	marks an article so that a further determination may take action from this article	only 1 article may be marked at a time if an article is already marked and no action has been taken using this mark then a warning is output and current article is marked
change_field_for_article	changes the field for text in the article	uses the field_name attribute to specify the new name for the field - note this field name supports XPath style syntax so user can change the text to being a child of “body” or a sibling as required. NB by default field restricted rules search all child fields so a “body” restricted rule will still fire on text that is in “body/body_example”
change_field_for_article_and_all_following	changes the field for article and all subsequent articles
change_field_for_all_following	changes field for all subsequent articles (but not currently determined
change_field_for_article_and_all_previous	changes field for all previous articles (and current)
change_field_for_all_previous	changes field for all previous articles (but not the currently determined one)
change_field_for_all_from_marked	changes field for all articles from (and including) a marked article (but not currently determined one)
change_field_for_article_and_all_from_marked	changes field for all articles from marked to current inclusive

The article_index child has the following attributes

Attribute	Values	Default	Notes
operation	“==”,“>=”,“<=”,“<”,“>”	“==”	specifies the binary operation to use
data	numeric	none	index to check against (index is 0 based)

Note that an article index is matched for a specific article (say with index N)

iff (N “operation” [data]) is true

ie <article_index operation=“>” data=“10”/> will match for all articles whose index is > 10

Generally if using determinators which apply to multiple templates (including the default statistical splitting) it is good practice to set these up all together in a specific template which does not match any documents (ie has no identifiers) and to name the determinators

<?xml version="1.0" encoding="utf-8"?>
<document-template file="Multiple Template Determinators.dot" name="Multiple Determinators">
    <determines applies_to="default,CSTest*" name="financial recommendations" action="change_field_for_article" field_name="body/body_recommendations">
       <title font-size="16" value="Financial recommendations"/>
       <paragraph value=".*recomends.*"/>
    </determines>
    <determines applies_to="*" name="test" action="delete_article" >
        <paragraph font-size="12" value=":w{0,5}test results.*" />
    </determines>
</document-template>

Semaphore Classification Server Rulebase Reference