Semaphore for NiFi Deployment Guide

Save PDF

Last Updated: April 5, 2026
9 minute read

Semaphore
Documentation

Important: This documentation applies to all versions of the Semaphore for NiFi solution. Any version-specific features are annotated in this style.

Introduction

This document describes the installation and configuration of Semaphore in a NiFi solution. This document assumes that both a working base Semaphore Classification Server installation is already in place and that NiFi is already installed and configured.

Note: One of the predominant use cases for Semaphore in NiFi is the classification of content from, and subsequent persistence of results back to MarkLogic. The NiFi Semaphore processor has been tested with MarkLogic NiFi tools version 1.8.03 Other common use cases are the classification of data from record sources such as RDBMS databases or spreadsheets.

The Semaphore for NiFi solution uses the Semaphore Classification Server product to automatically classify content as it passes through a NiFi flow. The processor can accept data to be classified in many forms in NiFi FlowFiles. Data can be the contents of the FlowFile, one or more FlowFile attributes identified as containing the body of data (when aggregated) to classify, or a FlowFile attribute identified as containing an external URL from which content can be accessed and classified. Further, FlowFile attributes can be identified as additional metadata supplementing the content to be classified. Output of classification can be specified as either RDF or Semaphore XML.

Important: An understanding of NiFi, Flows, Processors and FlowFiles is essential to using NiFi and the Semaphore processor. Please see the NiFi documentation for this information: http://nifi.apache.org/docs/nifi-docs/html/getting-started.html

The sample NiFi flow depicted reads a PDF file from the file system, classifies it using Semaphore (“semaphores” it) , and writes the resulting classification XML to MarkLogic.

Installation

Installation of the Semaphore NiFi processor involves copying the distributed “nar” file into a NiFi nar library directory. For experimental and development purposes, the file can be copied directly into the NiFi lib directory at <<NiFi installation directory>>/lib but for longer term, production installations this would be to an external NiFi nar library. See Core Properties.

This usually involves creating a directory external to the NiFi installation location and then modifying the NiFi configuration file to point at it. From the NiFi configuration documentation:

nifi.nar.library.directory
The location of the nar library. The default value is ./lib and probably should be left as is.

NOTE: Additional library directories can be specified by using the nifi.nar.library.directory. prefix with unique suffixes and separate paths as values.

For example, to provide two additional library locations, a user could also specify additional properties with keys of:

nifi.nar.library.directory.lib1=/nars/lib1
nifi.nar.library.directory.lib2=/nars/lib2

Providing three total locations, including nifi.nar.library.directory.

The NiFi configuration file is located in <<NiFi installation directory>>/conf/nifi.properties

Important: Following installation of the nar file, start or restart NiFi.

Configuration

While there are many NiFi-related settings that can be made on a processor, the two on which to focus are Relationships and Properties. Others are for advanced use and described in the NiFi documentation – they are not specific to the Semaphore NiFi processor.

Relationships

Relationships tell NiFi how to direct the results of processing. The Semaphore processor exposes three relationships: success, original, and failure.

The “success” relationship will have the results of classification as the FlowFile content (not the original content). The success relationship may also have the original FlowFile attributes if the property to copy these is set to true (and this is the default - see properties below). The success relationship is usually routed to the next step(s) in the NiFi flow.

The “original” relationship will send the original FlowFile, unadulterated to the destination. This could be used for other, parallel processing of the source data.

The “failure” relationship will have the original contents and attributes of the FlowFile, plus an additional attribute, “CS-ERROR” containing an indication of the error that was encounter. Failure is usually routed towards an error handling NiFi flow.

Relationships and Flow Mapping

Properties

The Semaphore NiFi processor is highly configurable. Each property has an associated help popup accessed by clicking on the ? icon. Most of these properties are described in more detail here: Classification and Language Service Test Interface.

Semaphore Processor Configuration Properties

These properties are exposed:

Classifier URL: URL to the classification server, which could be a cloud URL. Default value: localhost:5058
Score Threshold: the score for qualifying concepts to qualify. Default is 48.
Cloud API Key: The API key for accessing Classification Server in the cloud. No default. Presence of this attribute values causes cloud classification to be performed.
Single or Multi Article, or Unknown: Treat content as a single article, multiple articles, or let CS decide. Default is “Single”.
Classifier Output Format: Return either CS XML or RDF XML. Default is RDF.
Title Mapping Attribute: The name of an attribute to treat as a title. No default. This property can contain a single attribute name or a semi-colon separated list of attribute names to concatenate together.
Source URL Attribute: The name of an attribute supplying a URL from which CS can pull content to classify.
Source Body Attribute: The name of an attribute supplying text to classify. This property can contain a single attribute name or a semi-colon separated list of attribute names to concatenate together.
Source Language Attribute: The attribute name which will contain an ISO code for the language of the source data.
Default Language: An ISO language code to use as a default language for the source data.
Classifier Source Language: The ISO two-character language code to treat the content. Default is empty which has CS determine language. Example is “EN”.
Metadata Fields: a semicolon separated list of FlowFile attributes to be sent to CS as metadata to classify.
Classifier Advanced Properties: A semicolon delimited list of settings that would be changed infrequently:
- connection.timeout=1000000; CS connection timeout setting.
- socket.timeout=1000000; CS socket timeout setting
- token.request.url=https://cloud.smartlogic.com/token; Cloud URL for creating an access token.
- source.encoding=UTF-8; Source content encoding.
- send.feedback=no; Send feedback in the XML document returned from CS. Only works if “Classifier Output Format” is set to “XML” and not “RDF.”
- classifier.results.encoding=UTF-8; Results encoding
- copy.attributes=true; Causes attributes from the source FlowFile to be copied to the results FlowFile.
- preserve.filename=true; Causes the filename attribute from the source FlowFile to be copied to the results FlowFile. This is important for maintaining an ID for downstream processing.

Several of these properties are “mapped”. By that what is meant is that the property in the processor configuration would contain the name of the FlowFile attribute to use for that purpose. For example, the “Title Mapping Attribute” property would be set to the name of the FlowFile attribute that will contain a title, if any. That value might be “Title”, but it could be any field that has data to be used as a title. It can also specify several attributes names, the values from which will be concatenated to create a title. The same applies to the mapped attribute(s) for the BODY.

Selection of Content to Classify

As described above, content can be passed from the Semaphore NiFi processor to Classification Server from several sources: the FlowFile contents, a mapped “Body” attribute’s contents, or a mapped “URL” attribute accessed remotely. The processor chooses which to use as follows:

If present, the contents of the FlowFile will be used and a mapped Body or URL attribute ignored.
If no FlowFile content is present and a Body attribute is mapped in the configuration properties and has content, that is sent for classification.
Failing the prior two possibilities, if a URL attribute is mapped and has a value, this is sent to CS for classification.

Title and Bodies

The selection of content to classify (described above) is also configurable to use multiple source attributes concatenated to create a single source. The “Source Body Attribute” property can be a semi-colon separated list of incoming attributes to concatenate (space-separated) into a single block of content. This doesn’t change the order of selection described above but it does permit several attributes (e.g., the columnar data from a database) to be brought together.

Similarly, the “Title Mapping Attribute” can be a single attribute name or it can be a semi-colon separated list of names to concatenate together to create a title.

Results Formatting

Results from CS could be either in RDF or Semaphore XML, depending on the “Classifier Output Format” configuration property setting (either XML or RDF). The returned Semaphore XML format follows this grammar:

Request XML DTD

RDF follows this grammar:

https://www.w3.org/TR/rdf-syntax-grammar/#section-Infoset-Grammar

Specifying a Language

By default, the Semaphore NiFi processor is set to allow Classification Server to attempt to determine the language of the source content. However, this is not always desirable. There are two properties that allow some control over language determination. Ideally, an incoming FlowFile will carry knowledge of the language of its source data.

The Source Language property indicates the name of an attribute of the incoming FlowFile that will contain an ISO code for the language. This enables any source content to carry it’s own language specification.
The “Default Language” property sets the ISO code for the language (e.g., “en”) to be used when the incoming FlowFile has no source language attribute or if that is empty.

It is important to remember that if a language is specified and the corresponding language pack is not installed in Semaphore, Classification Server will return an error.

Creating Custom IDs

Occasionally, it is necessary to create a custom ID for the classified results. This could arise when separating the source content from the classification results. Older versions of the Semaphore NiFi processor supported creating custom IDs, but this feature has been deprecated in favor of extant NiFi processors to update attributes. An example configuration of the UpdateAttribute processor to modify the source filename attribute is show below.

Creating an ID from the filename attribute

This keeps the file name but replaces the extension with “.rdf”.

Important: Using the approach requires familiarity with the NiFi expression language described here.

Errors and Error Handling

The NiFi Semaphore Classification Processor attempt to continue a flow irrespective of any errors. As discussed above, there are three relationships describing the flow of output from the processor: “success”, “original,” and “failure”. The original or source FlowFile is always passed to the “original” relationship. A successful classification process sends results along the Success relationship. Any caught errors send a clone of the original FlowFile along the failure relationship, but with a message added to the FlowFile attribute “CS-ERROR”.

Success and failure can have somewhat ambiguous meanings. For this purpose, an error is specifically limited to an exception in the processing. For example, if a FlowFile specifies the CZE language as its source and the Czech language pack is not installed with CS, an exception will be thrown. There are many of these scenarios and the processor handles them all identically - exceptions are trapped, an note is made in the NiFi logs, a clone of the source FlowFile is created and the exception message is copied to the CS-ERROR attribute or the clone and the clone is passed to the failure relationship. In all other scenarios, the results of CS are used to create a new FlowFile (as described above) and the new FlowFile is passed to the success relationship. Ambiguity in this decision process arises when CS returns successfully with no useful classification results.

Data Provenance

Important: This feature is only available in version 1.9.2 and later.

The Semaphore Classifier processor emits provenance events when content has been successfully classified. These events will appear in the provenance log as follows:

Provenance Events List

Clicking on the information icon shows the details of the event. These details of the event also contain an “elapsed time” field representing the time the classification process took.

Provenance Event Detail