The Default Conversion Option

Save PDF

Last Updated: April 14, 2026
9 minute read

MarkLogic Server
Version 12.0
Documentation

This chapter describes the Default Conversion Option, which is designed to convert HTML files to XHTML and DocBook. It includes the following sections:

Installing the Conversion Pipelines and Framework
Simple Drag-and-Drop Conversion
What the Conversion Pipeline Generates
Understanding and Using the Default Conversion Option
Modifying the Default Conversion Option

Installing the Conversion Pipelines and Framework

The Default Conversion Option installation installs the Content Processing Framework for your database, sets up the domain for the pipeline, loads the needed triggers into the triggers database, and performs other pipeline initialization tasks. You need to install the Default Conversion Option for each database in which you plan on using conversion.

Complete the following steps to install the Default Conversion Option into a database.

If you have not already done so, Install MarkLogic Server.
If you have not already done so, Install MarkLogic Converters.
Open the Admin Interface to the database page for the database in which you want to install the Default Conversion Option. For example, if you want to install the pipeline into the Documents database, open the database page for the Documents database.

Note:

MarkLogic recommends creating a new database to use when testing the Default Conversion Option.
On the Database configuration page, select a Triggers Database to use with your database (for example, Triggers). You can use any database for the triggers database. It can be the same database as the one you are configuring (for example, you can set the Documents database as the triggers database for the Documents database) or it can be a different database (for example, the Triggers database created as part of the installation process).
Click OK to apply the changes to the database configuration.
In the left tree menu, click Content Processing under the database to which you want to install the Default Conversion Option. The content processing summary page appears.
On the content processing summary page, click the Install tab. The Content Processing Resource installation page appears.
On the Content Processing Resource installation page, set Enable Conversion to true and click Install. Make sure that Enable conversion is set to true. If it is set to false, then you will install only the Content Processing Framework, not the Default Conversion Option.
Click OK to confirm the installation of content processing in your database.
When the installation is complete, the Content Processing Summary page appears. It displays content processing installed in your database.

The Default Conversion Option is now installed for the database. The default domain determines which documents are processed, and by default it has a document scope that applies to any document in the database with a URI starting with a slash ( / ).You can modify the domain settings if you want the Default Conversion Option to apply to a different set of documents. To modify the domain settings, click the default domain for your database (for example, Default Documents if you chose the Documents database) on the Content Processing Summary pages and make the needed modifications. For details on domains, see Understanding and Using Domains.

Simple Drag-and-Drop Conversion

To try out the pipeline, you need to load some HTML files into the database. You can load the documents using any method you like. This section describes an easy way to load documents using a WebDAV server and client. You can then use this configuration to test document conversion with the Default Conversion Option. For more information on WebDAV servers, see WebDAV Servers in Administrate MarkLogic Server.

Complete the following steps to load and process documents in a database.

Create a WebDAV server with root / that accesses the database in which you installed the Content Processing Pipeline.
1. In the Admin Interface, go to the Groups > Default > AppServers page.
2. Click the Create WebDAV tab.
3. Enter a server name (for example, CPF).
4. Enter / for the root.
5. Enter a port number (for example, 9999).
6. Select the database in which you installed the content processing pipeline (for example, Documents).
7. Click OK.
If you will not be logging into the WebDAV client as a privileged user, set up the needed security requirements for your WebDAV root directory and your WebDAV user. For a sample of how to set this up, see Set the Needed Permissions on the Root Directory.
Create a WebDAV client that accesses the WebDAV server you just created. For example, the following procedure applies to Windows XP; other versions of Windows or other WebDAV clients have slightly different procedures:
1. Double-click My Network Places from your desktop.
2. Double-click Add Network Place.
3. For the location of the network place, enter the address with the hostname and port number of the WebDAV server you created. For example, if your server is on port 9999 of the local machine, enter the following:
```
http://localhost:9999
```
4. Click Next.
5. If prompted, enter a username and password for your WebDAV server.
6. Enter a name for your WebDAV folder (for example, conversion).
7. Click Finish.
8. If prompted, enter the username and password for your WebDAV server.
Drag-and-drop files into the WebDAV folder. This loads the documents in the database.
After some time has passed, refresh the WebDAV folder (for example, View > Refresh). The amount of time it takes to convert depends on the number, size, and the complexity of the documents being converted. For simple and small documents, it will take just a few seconds. For larger documents, it might take significantly longer.

The converted documents, as well as the original documents and any parts generated as part of the conversion, will appear in the WebDAV folder. If you have large documents or if you load many documents into the database, the processing might continue for several minutes or longer.

What the Conversion Pipeline Generates

After the conversion process is finished, for each document that you loaded, the Default Conversion Option produces the following:

The original document

An XHTML document (*.xhtml)

A simplified DocBook XML document (*.xml)

A directory (*_parts) containing various parts generated as part of the conversion process. The parts are typically any images that were in the original document, a cascading style sheet document (conv.css), and a document containing an analysis of the stylesheet (css.xml). PDF documents also include toc.xml, which is an analysis of the table of contents structure.

The generated XHTML and XML documents have a URI that includes the suffix of the original document. For example, a document called html.doc produces html_doc.xml and html_doc.xhtml.

Understanding and Using the Default Conversion Option

The Default Conversion Option uses the components of Content Processing Framework to create a unified conversion process which converts HTML files to well-structured XHTML and simplified DocBook format XML documents. This section provides some background on how the default conversion process works, and includes the following sections:

Components of the Default Conversion Option
Steps in the Conversion Process
Default Conversion Option States
Errors, Troubleshooting, Debugging, and Recovery

Note:

The MarkLogic Converters package may generate temporary files. These temporary files are not supported by encryption at rest.

Components of the Default Conversion Option

The Default Conversion Option includes the following components:

Status Change Handling Pipeline

HTML Pipeline

Supporting XQuery modules

The xdmp:tidy function built into MarkLogic Server

There are also supporting XQuery modules for the Default Conversion Option for the following:

Generic Conversion

DocBook Conversion

CSS Conversion

XHTML Conversion

These XQuery modules include the XQuery source code, so you can analyze them and use their functions in your own applications. The XQuery modules are installed into the following directory:

<install_dir>/Modules/MarkLogic/conversion

For details on these functions, see the MarkLogic XQuery and XSLT Function Reference.

Steps in the Conversion Process

The steps are defined in the following pipelines:

html-pipeline.xml

pipeline.xml

Generally, the conversion process performs the following tasks:

Check to see what kind of document it is.

Convert the document to XHTML based on its type.

Cleans up the converted XHTML.

Extract the style information into a CSS document.

Transform the XHTML to infer the table of contents structure for the document.

Transform the XHTML to create a simplified DocBook structured format for the document.

Default Conversion Option States

The conversion states are defined in the pipelines and are stored in the properties document for each document. The conversion process includes the following states:

http://marklogic.com/states/initial

http://marklogic.com/states/updated

http://marklogic.com/states/xhtml

http://marklogic.com/states/cleaned-xhtml

http://marklogic.com/states/structured-xhtml

http://marklogic.com/states/enhanced-xhtml

http://marklogic.com/states/analyzed-styles

http://marklogic.com/states/final

Errors, Troubleshooting, Debugging, and Recovery

This section describes the following error and troubleshooting situations you might encounter with the Default Conversion Option:

Set the Needed Permissions on the Root Directory
Default or Inherited Collections and Permissions
Enable Debugging Capabilities
Create Your Own Error Handling Pipeline

Set the Needed Permissions on the Root Directory

When you add documents to the database for conversion, the user who adds the documents must have the needed permissions to add and modify documents. If you are using WebDAV server to drag-and-drop documents into the database, the root directory of the WebDAV server must also have the needed permissions.

One simple way to accomplish these security requirements is to do the following:

Create a URI privilege for the URI that is configured as the root directory of your WebDAV server.

Create a role that has the URI privilege and has default permissions of read. insert, and update for the role.

Set the permissions on the WebDAV root directory for the role you created. For example, if the role you created is named webdav, and the root directory has the URI /webdav/root/, run a query (as a privileged user) similar to the following:

xdmp:document-set-permissions("/webdav/root/",
  ( xdmp:permission("webdav", "read"),
    xdmp:permission("webdav", "insert"),
    xdmp:permission("webdav", "update") ) )

You can check the permissions with the following query:

xdmp:document-get-permissions("/webdav/root/")

Grant the new role (webdav in the example above) to the user who accesses the WebDAV server.

Default or Inherited Collections and Permissions

If you are using a collection in the domain to specify which documents to convert, the new documents created by the conversion process must be created as part of the collection specified in the domain. You can do this in the following ways:

Set the inherit collections option at the database level to true and make sure the parent directory belongs to the collection.

The user who runs the Default Conversion Option (that is, the user who originally creates the documents to be converted, whether by drag and dropping into a WebDAV folder or by some other means) can have the collection specified as a default collection (or a role to which the user is assigned).

You can explicitly set the collection on a document (for example, in your XQuery module code or through XDBC).

Otherwise only the first phase of conversion will occur (because documents created during the conversion process will not be part of the collection specified in the domain). Similarly, you must have either the appropriate default permissions assigned to the user (or a role to which the user is assigned) or you should set the permissions to inherit at the database level.

For information on inherited collections and inherited permissions, see Administer MarkLogic Server. For information on permissions, see Secure MarkLogic Server.

Enable Debugging Capabilities

If you need debugging capabilities, you can set trace events on the server for the Content Processing Framework. For details, see Debugging and Recovering from Error Conditions.

Create Your Own Error Handling Pipeline

If you have special error handling needs, you can always extend the Default Conversion Option application by adding your own custom error handling pipeline. For details on pipelines and creating custom code, see Understanding and Using Pipelines and Using the Framework to Create Custom Applications.

Modifying the Default Conversion Option

This section describes ways to modify the Default Conversion Option, and includes the following subsections:

Copy Defaults and Modify
Modifying the Options for Default Conversion

Copy Defaults and Modify

All of the XQuery code and all of the pipelines for the Default Conversion Option are installed with MarkLogic Server. The pipelines are installed in the following directory:

<install_dir>/Installer

The XQuery modules are installed under the Modules directory in the following location:

<install_dir>/Modules/MarkLogic/conversion/actions

You can create your own pipelines by copying and modifying the Default Conversion Option code to suit your needs. Make sure you understand domains, pipelines, the concepts of the Content Processing Framework, and the rules for XQuery modules in content processing applications before modifying the pipelines. For information on these topics, see the rest of this document.

The modification possibilities are endless. You can add phases to the pipeline to do your own processing, add email notification to your application, add entity extraction from a semantic tagging service, and so on. For information on creating custom applications, see Using the Framework to Create Custom Applications.

Modifying the Options for Default Conversion

The Default Conversion Option uses the built-in function xdmp:tidy(). The pipelines reference various XQuery modules that call this function. This function takes an options node to control its behavior. The default options work well with a large variety of documents, but you can customize them to your documents' specific needs.

Here is an example of a single action on an XHTML pipeline:

<action>
   <module>/MarkLogic/conversion/actions/convert-html-action.xqy</module>
   <options xmlns="/MarkLogic/conversion/actions/convert-html-action.xqy">
      <destination-root/>
      <destination-collection/>
   </options>
</action>

action: Specifies the operation to be performed if the condition is satisfied.

module: Specifies the module that implements the action.

The options node has these options:

destination-root: Specifies an alternate directory URI where the output of the conversion processing is saved.
destination collection: Specifies an alternate collection to which to add the outputted documents.

These options are passed to any xdmp:tidy() call within the module.