Disk Storage Considerations

Save PDF

Last Updated: April 14, 2026
9 minute read

MarkLogic Server
Version 10.0
Documentation

This chapter describes how disk storage can affect performance of MarkLogic Server, and some of the storage options available for forests. It includes the following sections:

Disk Storage and MarkLogic Server
Fast Data Directory on Forests
Large Data Directory on Forests
HDFS, S3, and Blob Storage on Forests
Windows Shared Disk Registry Settings and Permissions

Disk Storage and MarkLogic Server

MarkLogic Server applications can be very disk-intensive in their system demands. It is therefore very important to size your hardware appropriate for your workload and performance requirements. The topic of disk storage performance is complicated; there are many factors that can influence performance including disk controllers, network latency, the speed and quality of the disks, and other disk technologies such as storage area networks (SANs) and solid state drive (SSD). As with most performance issues, there are price/performance trade-offs to consider.

For example, SSDs are quite expensive compared with rotating drives. Conversely, HDFS (Hadoop Distributed Filesystem) or object storage services like Amazon S3 (Simple Storage Service) and Azure Blob can be quite inexpensive but might not offer all of the speed of conventional disk systems.

Fast Data Directory on Forests

In the forest configuration for each forest, you can configure a Fast Data Directory. The Fast Data Directory is designed for fast filesystems such as SSDs with built-in disk controllers. The Fast Data Directory stores the forest journals and as many stands as will fit onto the filesystem; if the forest never grows beyond the size of the Fast Data Directory, then the entire forest will be stored in that directory. If there are multiple forests on the same host that point to the same Fast Data Directory, MarkLogic Server divides the space equally between the different forests.

When the Fast Data Directory begins to approach its capacity, during periodic merges, MarkLogic Server will start to put data in the regular Data Directory. By specifying a Fast Data Directory, you can get much of the advantage of using the fast disk hardware while only buying a relatively small SSD (or other fast disk system). For example, consider a scenario where you have an 8-core MarkLogic Server d-host that is hosting 4 forests. If you have good quality commodity server-class rotating disk system with many magnetic disk spindles (for example, 6 disks in some RAID configuration) having 2 terabytes of storage, and if you have a 250 gigabyte SSD (for example, a PCI I/O accelerator card) for the fast data directory, then you can get a significant amount of the benefit of having the SSD storage while keeping the cost down (because the rotating storage is several times less expensive than the SSD storage). In this scenario, each of the 4 forests could use up to 1/4 of the size of the SSD, or about 62.5 GB. Once the forest size grows close to that limit, then the Data Directory with the rotating storage is used.

Large Data Directory on Forests

Just as you might want a different class of disk for the Fast Data Directory, you might also want a different class of disk for the Large Data Directory. The Large Data Directory stores binary documents that are larger than the Large Size Threshold specified in the database configuration. This filesystem is typically a very large filesystem, and it may use a different class of disk than your regular filesystem (or it may just be on a different set of the same disks). For more details about binary documents, see Working With Binary Documents in the Application Developer's Guide.

HDFS, S3, and Blob Storage on Forests

HDFS (Hadoop Distributed Filesystem), Amazon S3 (Simple Storage Service), and Azure Blob storage represent three approaches to large distributed filesystems. All three can be used to store MarkLogic Server forest data. This section describes considerations for using HDFS, S3, and Blob for storing forest data in MarkLogic Server:

HDFS Storage
S3 Storage
Blob Storage

Both HDFS and object storage services like S3 and Blob can be very useful when implementing a tiered storage solution. For details on tiered storage, see Tiered Storage in Administrate MarkLogic Server.

HDFS Storage

[Hadoop support was deprecated in MarkLogic 10.0-3 and removed in MarkLogic 10.0-5]

[HDFS support was deprecated in MarkLogic 10.0-4 and removed in MarkLogic 11.0.0]

HDFS is a storage solution that uses Hadoop to manage a distributed filesystem. Hadoop has tools to specify how many copies of each file are replicated on how many different servers. HDFS gives you a high degree of control over your filesystem, as you can choose the disks to use, the computers to use, as well as configuration settings such as number of copies to replicate.

MarkLogic Server can use Kerberos Secured HDFS as a file system on Linux platforms. To configure Kerberos authentication for Secured HDFS, set these environment variables in your /etc/marklogic.conf file:

Environment Variable	Value
`MARKLOGIC_KEYTAB`	Path to the Kerberos client keytab file.
`MARKLOGIC_PRINCIPAL`	Kerberos Principal to be authenticated.

Note:

Configuring Kerberos for external security is unrelated to securing your HDFS using Kerberos. Using a single Kerberos instance for both features is supported.
If you are using the same instance of Kerberos for both external security and HDFS, then use separate credentials for each.

Note:

When using rolling upgrades, deploy your credential keytab files after the cluster has been fully upgraded to MarkLogic Server 10. Otherwise, the behavior of accessing secure HDFS will be undefined.

HDFS storage is supported with MarkLogic Server on these HDFS platforms:

Cloudera CDH version 5.8

Hortonworks HDP version 2.6

Internally, MarkLogic Server uses JNI to access HDFS. When you specify an HDFS path for one of the data directories, MarkLogic Server will write the forest data directly to HDFS according to the path specification.

When you set up an HDFS path as a forest directory, the path must be readable and writable by the user in which the MarkLogic Server process is running.

Because you can set up HDFS as a very large shared filesystem, it can be good not only for forest data, but also as a destination for database backups.

An HDFS path has this form:

hdfs://<machine-name>:<port>/directory

For example, this path would be to an HDFS filesystem accessed on a machine named raymond.marklogic.com on port 12345:

hdfs://raymond.marklogic.com:12345/directory

Each MarkLogic Server host that uses HDFS for forest storage requires access to these components:

The Oracle/Sun Java JDK (or an Oracle/Sun JRE that includes JNI)

Hadoop HDFS client JAR files

Your Hadoop HDFS configuration files

These HDFS configuration property settings are required:

dfs.support.append: true. This is the default value.

dfs.namenode.accesstime.precision: 1

The remainder of this section describes how to configure your hosts so that MarkLogic Server can find these components.

For details on the supported Java versions and how MarkLogic Server locates a JRE, see Java Virtual Machine Requirements in Install MarkLogic Server.

Though MarkLogic Server does not ship with HDFS client libraries, you can download client library bundles from http://developer.marklogic.com/products/hadoop.

Follow these steps to make the bundled libraries and configuration files available to MarkLogic Server. You must follow these steps on each MarkLogic Server host that uses HDFS for forest storage:

Download the Hadoop client bundle that corresponds to your Hadoop distribution from http://developer.marklogic.com/products/hadoop.
Unpack the bundle to one of these locations: /usr, /opt, /space. For example, if you download the HDP bundle for MarkLogic 9.0-7 to /opt, then these commands unpack the bundle to /opt.
```
cd /opt
gunzip hadoop-hdfs-hdp-9.0-7.tar.gz
tar xf hadoop-hdfs-hdp-9.0-7.tar
```
The bundle unpacks to a directory named hadoop, so the above commands create /opt/hadoop/. The version portion of your bundle download filename may differ.
Make your Hadoop HDFS configuration files available under /etc/hadoop/conf/. You must include at least your log4j.properties configuration file in this location.
Ensure the libraries and config files are readable by MarkLogic Server.

For more information on Hadoop and HDFS, see the Apache Hadoop documentation.

S3 Storage

S3 is a cloud-based storage solution from Amazon. S3 is like a filesystem, but you access it via HTTP. MarkLogic Server uses HTTP to access S3, and you can put an S3 path into any of the data directory specifications on a forest, and MarkLogic will then write to S3 for that directory. For more details about Amazon S3, see the Amazon web site http://aws.amazon.com/s3/. This section describes S3 usage in MarkLogic and includes the following parts:

S3 and MarkLogic
Entering Your S3 Credentials for a MarkLogic Cluster

S3 and MarkLogic

Storage on S3 has an eventual consistency property, meaning that write operations might not be available immediately for reading, but they will be available at some point. Because of this, S3 data directories in MarkLogic have a restriction that MarkLogic does not create Journals on S3. Therefore, MarkLogic recommends that you use S3 only for backups and for read-only forests, otherwise you risk the possibility of data loss. If your forests are read-only, then there is no need to have journals.

When you set up an S3 path as a forest directory, the path must be readable and writable by the user in which the MarkLogic Server process is running. Typically, this means you must set Upload/Delete, View Permissions, and Edit Permissions on the AWS S3 bucket. This is true for both forest paths and for backup paths.

Because S3 is a very large shared filesystem, it can be good not only for forest data, but as a destination for database backups.

To specify an S3 path in MarkLogic, use a URL of the following form:

s3://<bucket-name>/<path-to-location>

so the following path would be to an S3 filesystem with a bucket named my-bucket and a path named my-directory:

s3://my-bucket/my-directory

Important:

Amazon has other ways to set up S3 URLs, but use the form above to specify the S3 paths in MarkLogic; for more information on S3, see the Amazon documentation.

Entering Your S3 Credentials for a MarkLogic Cluster

S3 requires authentication with the following S3 credentials:

AWS Access Key

AWS Secret Key

The S3 credentials for a MarkLogic cluster are stored in the security database for the cluster. You can only have one set of S3 credentials per cluster. You can set up security access in S3, you can access any paths that are allowed access by those credentials. Because of the flexibility of how you can set up access in S3, you can set up any S3 account to allow access to any other account, so if you want to allow the credentials you have set up in MarkLogic to access S3 paths owned by other S3 users, those users need to grant access to those paths to the AWS Access Key set up in your MarkLogic Cluster.

To set up the AW credentials for a cluster, enter the keys in the Admin Interface under Security > Credentials. You can also set up the keys programmatically using the following Security API functions:

sec:credentials-get-aws

sec:credentials-set-aws

The credentials are stored in the Security database. Therefore, you cannot use S3 as the forest storage for a security database.

Blob Storage

Note:

MarkLogic Server supports Azure Blob Storage only for backup and read-only forests. It does not support regular or replica forests.

MarkLogic Server supports Azure Blob Storage for backup and read-only forests as part of Tiered Storage. This support allows you to leverage Azure's scalable and durable cloud storage for archival and low-access forest data.

Overview

Azure Blob Storage offers advantages such as append support and strong consistency, making it suitable for journal files and transactional operations.

Azure pathnames in MarkLogic Server follow this format:

azure://<container>/<directory>/<file>

Where

<container> is the Azure Blob container.
<directory> is the folder path within the container.
<file> is the specific blob object.

Requirements

To use Azure Blob Storage with MarkLogic Server forests, you need these items:

An Azure Storage Account
At least one Blob Container
Access credentials (either through access keys or VM identity)

Configuration Steps

Create Azure Storage Resources:
1. Log into the Azure Portal.
2. Create a Storage Account.
3. Within the account, create a Blob Container.
Configure Credentials in MarkLogic Server in one of these ways:

Through the MarkLogic Server Admin Interface

Follow these steps:
1. Navigate to Security > Credentials.
2. Enter your Azure Storage Account name into Azure Storage Account.
3. Enter your Access Key into Azure Storage Key.
  You can find the Access Key under Access Keys in the Azure Storage Account settings.
4. Click OK.
Through MarkLogic Server Environment Variables (Linux)

Add these environment variables to /etc/marklogic.conf:
- export MARKLOGIC_AZURE_STORAGE_ACCOUNT=<your-storage-account>
- export MARKLOGIC_AZURE_STORAGE_KEY=<your-access-key>
Through MarkLogic Server Functions

Use these functions to set and retrieve your credentials:
- sec.credentialsSetAzure() or sec:credentials-set-azure()
- sec.credentialsGetAzure() or sec:credentials-get-azure()
Note:
- MarkLogic Server stores Azure Storage Access Keys encrypted in the security database.
- Treat Azure Storage Access Keys like passwords.
- Instead of using credentials, you can use VM Identity for access. See Azure documentation on Managed Identity.
Set the Forest Data Directory.
When creating or configuring a forest, specify the data directory using this Azure path format:

azure://mycontainer/mydirectory
Enable Tiered Storage.
Use Tiered Storage to assign forests to Azure Blob Storage for archival or read-only access.

Forests on Azure Blob can include journals.

Proxy Configuration (Optional)

There are two ways to route Azure Blob Storage access through a proxy:

Configure the proxy in the Admin Interface under Groups > (Choose Group) > Configure > Azure Storage Proxy, or
Set the MARKLOGIC_AZURE_STORAGE_PROXY environment variable in /etc/marklogic.conf.

Windows Shared Disk Registry Settings and Permissions

If you are using remote machine file paths on Windows (for example, a path like \\machine-name\dir, where machine-name is the name of the host and dir is the path it exposes as a share), then you must set these registry settings to ZERO, as shown in https://technet.microsoft.com/en-us/library/ff686200.aspx:

FileInfoCacheLifetime

FileNotFoundCacheLifetime

DirectoryCacheLifetime

These DWORD registry key settings are in this registry:

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Lanmanworkstation\Parameters

Additionally, the directory path must have read and write permissions for the SYSTEM user, or whichever user under which MarkLogic.exe runs.

Query Performance and Tuning