Date Handling

Save PDF

Last Updated: July 8, 2026
6 minute read

Semaphore
Documentation

Dates in a document are identified and marked as a zone of the document - this allows the expression rule to be used to provide discovered date information for extraction or to control other rule firings.

The date zoner supplied with CS is implemented in two parts - date discovery and date normalisation. The reason for splitting the functionality is that dates may be discovered by various other zoners (for example, the language pack may be configured to discover dates). If some other zoner is discovering dates as a by-product of other analysis then you may avoid the time penalty of running date discovery but still apply the normalisation rules from DateZoner to these other dates to be consistent in date handling.

Configuration

<bean id="DateZoner" class="DateZoner">
    <!-- Formats are split into those for discovery of dates in a document and those for normalising dates already discovered by
         some other process (typically TF) By default DateZoner just applies the normalisations - enable the discovery patterns
         (and tune for your requirements if necessary) if dates are not being discovered which you require-->
    <!--property name="DiscoveryPatternsFile" value="../conf/DateFormatsDiscovery.txt"/ -->
      <property name="NormalisationPatternsFile" value="../conf/DateFormatsNormalisation.txt"/>
    
    <property name="dateFormatDay" value="yyyy-MM-dd"/>
    <property name="dateFormatMonth" value="yyyy-MM"/>
    <property name="dateFormatQuarter" value="yyyy-QQQ"/>
    <property name="dateFormatYear" value="yyyy"/>
  </bean>

If date discovery is wanted for a particular install then simply uncomment the “DiscoveryPatternsFile” and restart Classification Server (CS). This should provide a reasonable starting point for date discovery since most common formats of dates used are listed as a discovery pattern.

Note: The “DiscoverPatternsFile” setting is not enabled by default since there may be noticeable performance impacts with date discovery used - in particular the determination of quarters as in “1Q22” meaning “1st quarter of 2022” may significantly impact classification time of excel spreadsheets with large amounts of alphanumeric cells.

This timing regression may be addressed in a suitable way per installation considering the following:

Is it a problem for the corpus (many large excel spreadsheets)?
Are financial quarters dates of interest for the project or should they just be removed from the discovery formats?
Etc, etc.

However this requires a small amount of time balancing cost vs benefits and so is not configured by default to avoid poor performance at early stages of a project.

Discovery

This is controlled by a list of candidate date formats in a text file specified in the configuration The syntax is a format per line enclosed by { format=“XXXX” }, for example:

{ format="d'st' MMMM y" }
{ format="yy-MM" resolution="month" earliest="1800-1-01" }
...

With the format using the Unicode CLDR date encoding format characters. These are used via the “ICU” date library so may be not be the current set depending on update schedules (see icu date formats for the full list of currently supported format characters).

So the explanation of the values used in the 2 formats given above are:

d - Day in month (not zero padded)
‘st’ - Quoted text to match
MMMM - Full month in year - e.g. September
y - Full year (including century)
Resolution (“day”, “month”, “quarter” or “year”) - Maximum resolution given by date
earliest - Dates in this format which are earlier than this are ignored

So the first format discovers dates in English documents like “1st September 1997”. In the second “97-11” has an earliest date specified however this is unnecessary in this case since the format only gives dates between 1930 and 2029 which are all greater than the earliest given date.

Note that these are matched in the most specific way possible, so you require a different format to match “2nd September 1997” since the ‘st’ is required to be present to match the first format. However by being as strict as possible in matching allows the greatest flexibility and control in discovering dates - though you may have to add more format lines than you think could possibly be required to match a particular type of date formatting used in the corpus.

The formats dateFormatDay etc given in the configuration give the normalised for the appropriate format resolution.

Note that when 2 formats match a particular date but give different date points with the same resolution, for example “01/02/2020” which could be 1st of February 2020 or 2nd of January 2020 then the 1st format listed in the discovery file is used to normalise the date, however, the date is also marked as “date_ambiguous”. If there is no ambiguity in reading the date then the date is merged as “date_unambiguous”. This allows control over date extraction which can mark dates which may be incorrectly normalised - and so may be sent for manual review.

Note that some documents will differ in style of date used within a single document so assuming the type of format used elsewhere in the document is problematic.

This might seem unnecessarily complexity however to take a real world example (which was discovered during some project implementation using CS) passport documentation in the US uses “DD/MM/YYYY” format for the date of birth and expiry of passport since this form is mandatory for all passports world wide, however, if the document is written by an American they may well use the more standard (in America) “MM/DD/YYYY” for other dates such as date of passport application without noticing the mismatch. Being explicit about which normalised form is selected and simply providing the information that it may be incorrectly normalised creates a much more robust system when this is important whilst still “just working” when ambiguity in date formats is of little relevance to the project.

    <union extract_group="date" extract_group_key="date" >
       <expression type="date" extract_name="date" extract_evidence="date raw" />       
       <expression type="date_ambiguous" extract_default="ambiguous:true" />
       <expression type="date_unambiguous" extract_default="ambiguous:false" />
    </union>

This extracts all dates - normalised by the 1st occurring format in the list where ambiguous - groups them by the normalised date and marks whether the particular date is possibly ambiguous or not and provides the raw form taken from the document. This information may be passed to a exception check stage during processing if ambiguous if getting dates extracted and normalised accurately is critical to the project success.

Normalisation

Normalisation is the part of date zoner which determines ambiguity and normalises according to the order in the list. However this is available to be applied to dates discovered by other zoners than “DateZoner” when required.

The normalisation formats are typically much simpler than the discovery formats:

{ format="d M y" }
{ format="M d y" }
{ format="y MM dd" }
{ format="M yyyy" resolution="month" earliest="1800-1-1" }

Since in this case something has already determined that some bit of text is a date so we simply parse in a lenient manner and so ignore extraneous information (such as ‘st’ in 1st) and so can have many fewer formats. Just like in the discovery case though the first format which matches (with the best resolution) will be used as the normalisation form - so again if “M d y” is preferred as the normalisation for the corpus then change the order of these formats in the file. Again, dates which are ambiguous are marked as “date_ambiguous” and those which have no ambiguity are marked as “date_unambiguous” allowing more complex control when accuracy of normalisation is critical.

Return

To alter the format used to return the dates, you can alter the bean in the CS config file to use the appropriate ISO Unicode.

That is, if you wished to return dates in this format:

9 Sep 2023

you would use this format:

<property name=“dateFormatDay” value=“d MMM yyyy” />

If you wished to return the date in this format:

09 Sep 2023

you would use this format:

<property name=“dateFormatDay” value=“dd MMM yyyy” />

and so on.

Example:

<bean id="DateZoner" class="DateZoner">
  <!-- Formats are split into those for discovery of dates in a document and those for normalising dates already discovered by
       some other process (typically TF) By default DateZoner just applies the normalisations - enable the discovery patterns
       (and tune for your requirements if necessary) if dates are not being discovered which you require-->
  <property name="DiscoveryPatternsFile" value="../conf/DateFormatsDiscovery.txt"/>
  <property name="NormalisationPatternsFile" value="../conf/DateFormatsNormalisation.txt"/>
  <property name="dateFormatDay" value="d MMM yyyy" />
  <property name="dateFormatMonth" value="d MMM yyyy" />
  <property name="dateFormatQuarter" value="QQQ yyyy"  />
  <property name="dateFormatYear" value="d MMM yyyy" />
</bean>

Semaphore Classification Server Rulebase Reference

Date Handling

Table of Contents

Date Handling

Configuration

Discovery

Normalisation

Return