Building Vector Queries

Save PDF

Last Updated: May 18, 2026
14 minute read

MarkLogic Server
Version 12.0
Documentation

Semantic search via vector embeddings generated by Large Language Models (LLMs) has been booming in popularity with the rise of AI applications like ChatGPT, Gemini and Meta AI. A vector embedding is a many-dimensional array of floating point numbers, usually generated from a pre-trained Classifier Model. Vector embeddings generated from these models 'encode' semantic information which represent meaning and sentiment of the provided input. It has proven valuable to be able to use these vector embeddings to search for semantically similar documents in their dataset.

A vector embedding can be thought of as a point in a vector space, similar to a point on a map of the Earth. Instead of two dimensions as a point on a map would represent, vector embeddings commonly have 1000+ dimensions. e.g. OpenAI's text-embedding-ada-002 generates a vector of 1536 dimensions, no matter the size of the input text. If two different texts are processed using the same model, and their vector representations are found to be close to each other in the vector space, they are considered similar, even if none of the keywords match. For example, the words "matriarch", "president" and "queen" would have similar vectors compared to the vectors for words "car", "transcendence" and "John".

In general, you would cycle through all your documents and generate the vector embeddings of your documents, or any text of your data. This could be done using the entire text or in meaningful chunks, especially when the text contains many concepts or concerns. The generated embeddings are then added as part of your document.

During query, you will use the same tool to generate the vector equivalent of your search query. You can then use the query vector to compare against the embeddings stored in your document.

Scenario Setup

We will discuss the various features related to vectors along with an example. This allows us to better understand and visualize these concepts.

Sample Documents

We start by creating our documents. Use the following query on your QConsole.

'use strict';

declareUpdate();

xdmp.documentInsert(
  '/53068.json',
  {
    "envelope": {
      "instance": {
        "id": 53068,
        "title": "Trojan War",
        "text": "The Trojan War was one of the most important wars in the history of the late Bronze Age. It happened between the Trojans and the Greeks. It is mostly known through the \"Iliad,\" an epic poem written by the Ancient Greek poet Homer. carrying the dead Achilles, protected by Hermes (on the left) and Athena (on the right). Side 1 from an Attic black-figure neck-amphora, ~520-510 BC. The Louvre, Paris. In the middle 19th century scholars thought Troy and the war were mythical; that they never existed. However, Heinrich Schliemann discovered the site of ancient Troy, across the Aegean Sea on Asia Minor. The war may have taken place in the 12th century BC.",
        "url": "https://simple.wikipedia.org/wiki?curid=8126",
        "wiki_id": 8126,
        "views": 73.2369689941406,
        "paragraph_id": 0,
        "langs": 97
      }
    }
  }
)

xdmp.documentInsert(
  '/53069.json',
  {
    "envelope": {
      "instance": {
        "id": 53069,
        "title": "Trojan War",
        "text": "The Bronze Age was the first era known for humans to create tools and weapons made out of metal which replaced their stone versions. Beginning in about 3,300 B.C throughout the Middle East and parts of Asia, humans made many innovative advances throughout this age. Bronze Age civilizations interconnected through trade, war, migration, and innovation. However, the age ended quickly in 1200 B.C., when many civilizations fell at once. One of the most well known ancient civilizations to fall was the city of Troy. Branching off of the Mycenaean civilization and located in Histarlik, the northeast coast of Turkey, this ancient city dates back to over 2,700 years ago. Believed to be inhabited for almost 4,000 years beginning in 3,000 B.C., this civilization developed grand palaces by building on top of one city after another was destroyed. This formed into a human-made mound called a “tell”. Gert Jan van Wingaarden, in his book “Troy: City, Homer and Turkey,” writes, “ There is no one single Troy, there are at least 10, lying in layers on top of each other.” He says that the city of Troy contains many layers which is why archeological excavators have yet to reach the remains of the first settlement. Along with enhancing their city the Trojans developed their own writing system and occupied the Dardanelles, a narrow water channel connecting the Aegean Sea to the Black Sea. The writing system and water channel advanced the city of Troy into a powerful civilization which allowed for many allies to be made but also an arising rivalry.  According to Homer’s story, Illiad, the civilization was doomed to fall as long as the Trojan King’s son, Alexander, remained alive due to a curse placed upon him at birth by Zeus. The story of the Trojan war  concludes to why such an advanced and powerful civilization like Troy was able to be completely destroyed.",
        "url": "https://simple.wikipedia.org/wiki?curid=8126",
        "wiki_id": 8126,
        "views": 73.2369689941406,
        "paragraph_id": 1,
        "langs": 97
      }
    }
  }
)

xdmp.documentInsert(
  '/53073.json',
  {
    "envelope": {
      "instance": {
        "id": 53073,
        "title": "Trojan War",
        "text": "The war went on for ten years swinging to one side and then the other. Some of the leading fighters were Achilles, Paris, and Hector. The Greeks won by building a big wooden horse, which we now call the Trojan Horse. Greek soldiers hid inside the horse, and others put the horse on the shore and left in their boats. The Trojans saw the horse and thought that the Greeks had given up and left. They thought the horse was a gift in their honour. They dragged the horse into Troy and celebrated their victory. When night fell, the Greeks hiding inside the horse opened the city gates and set fire to the houses. The Greeks who had left in their boats had just pretended to leave, to trick the Trojans. They returned and won the war. The trick was thought up by Odysseus, King of the small island of Ithaca.",
        "url": "https://simple.wikipedia.org/wiki?curid=8126",
        "wiki_id": 8126,
        "views": 73.2369689941406,
        "paragraph_id": 5,
        "langs": 97
      }
    }
  }
)

Use of an Embedding Model

'use strict';

function getVectorEquivalent(
  inputText
) {
  let url = 'https://api.openai.com/v1/embeddings'
  let options = {
     "headers": {
       "Authorization" : "Bearer sk...EA",
       "Content-Type" : "application/json"
     },
     "data": xdmp.quote({
      "input": inputText,
      "model": "text-embedding-ada-002"
    })
  }

  return xdmp.httpPost(url, options).toObject()[1];
};

The script above is for demonstration purposes only. This function can be added as part of a module for reuse.

Note

The above function assumes that the response body contains only the vector value. Adjust the code as needed to match the structure based on the REST API endpoint being used.

Text to Vector Conversion

We can use the function getVectorEquivalent() to get the vector equivalent of the property text. Then store that vector as part of our document.

'use strict';
declareUpdate();

for (const doc of fn.doc()){
  const obj = doc.toObject();
  const text = obj.envelope.instance.text;
  // generate the embedding 
  obj.envelope.instance.textEmbedding = getVectorEquivalent(text);
  xdmp.nodeReplace(doc, obj)
}

The script above is for demonstration purposes only. Use Flux if a large data set needs to be processed.

Distance Metrics

Similarity and distance between vectors can be computed in different ways. For text search, the most prevalent approach is to use either "euclidean distance" or "cosine similarity".

Euclidean Distance calculates the straight-line distance between two points in Euclidean space, considering both magnitude and direction. This metric is often used in clustering algorithms to determine how far apart data points are.

Cosine Similarity measures the cosine of the angle between two vectors, focusing on the direction rather than the magnitude. It's particularly useful in text analysis and recommendation systems where the orientation of vectors matters more than their length.

Vector Functions

Computing distance between two vectors requires support for vector arithmetic and other helper functions. MarkLogic 12 introduces the vector library.

Vector constructors converts strings or, arrays or sequence of numbers into vector datatypes.
Vector operators performs operations on vector values, such as cosine similarity or euclidean distance.

See Vector Functions in Develop Server-Side Applications for more information.

Result Reranking Using Vectors

Traditional lexical search involves matching exact words against the text of your existing documents. MarkLogic already expands on this capability with the use of stemming that matches terms using their root words. Results are then ranked according to their relevance to the terms of the search query. This relevance is based on the presence and the frequency of the search terms based on the defined constraints.

'use strict';
const op = require('/MarkLogic/optic');

let queryText = "How did Achilles bring down Troy?"
// we reduce the CTS search terms to those with length higher than 3.
let terms = queryText.toLowerCase().split(' ').filter(term => term.length > 3)
let ctsQuery = cts.wordQuery(terms)

op.fromSearchDocs(ctsQuery)
  .result()

The ranking of results of the above query will be based on the frequency of the terms achilles, bring, down, and troy, with no consideration to the semantic relevance of the actual text.

Vector distance can be used to re-rank the results. The following query makes use of op.vec.cosineDistance() to compute the angle between the vector equivalent of our queryText with the vector embedding.

'use strict';
const op = require('/MarkLogic/optic');

function getVectorEquivalent(...){ ... }

let queryText = "How did Achilles bring down Troy?"

// we reduce the CTS search terms to those with length higher than 3.
let terms = queryText.toLowerCase().split(' ').filter(term => term.length > 3)
let ctsQuery = cts.wordQuery(terms)
let queryVector = getVectorEquivalent(queryText)

let distanceCol = op.as(
    'distance',
    op.vec.cosineDistance(
      queryVector,
      op.vec.vector(
        op.xpath('doc', '/envelope/instance/textEmbedding')
      )
    )
  )

op.fromSearchDocs(ctsQuery)
  // The lower the value, the more similar they are.
  .orderBy(distanceCol)
  .result()

Refer to Use of an Embedding Model regarding definition of the function getVectorEquivalent(). It is important that you use the same LLM model for your embeddings and your query text.

We could also take this further by combining the CTS score and the vector distance to acquire a hybrid score. This allows for a ranking that incorporates the advantages of both ranking systems.

'use strict';
const op = require('/MarkLogic/optic');

function getVectorEquivalent(...){ ... }

let queryText = "How did Achilles bring down Troy?"

// we reduce the CTS search terms to those with length higher than 3.
let terms = queryText.toLowerCase().split(' ').filter(term => term.length > 3)
let ctsQuery = cts.wordQuery(terms)
let queryVector = getVectorEquivalent(queryText)

let distanceCol = op.as(
    'distance',
    op.vec.cosineDistance(
      queryVector,
      op.vec.vector(
        op.xpath('doc', '/envelope/instance/textEmbedding')
      )
    )
  )

let hybridScoreCol = op.as(
    'hybridScore',
    op.vec.vectorScore(op.col('score'), op.col('distance'))
  )
  
op.fromSearchDocs(ctsQuery)
  .bind(distanceCol)
  .orderBy(op.desc(hybridScoreCol)) // higher values indicate higher relevance
  .result()

Refer to Use of an Embedding Model regarding definition of the function getVectorEquivalent(). It is important that you use the same LLM model for your embeddings and your query text.

terms is a list of words in the original search query. The example uses a naive approach to remove "stop" words that are three characters or less but this will vary based on the use case.

This hybridScore will rate documents using the cts score and the computed vector distance. This approach improves the relevance accuracy of your search results.

Nearest Neighbor

Nearest neighbor search is a fundamental technique in machine learning and data analysis used to find the closest vectors in a dataset to a given query vector. There are two main approaches: k-Nearest Neighbors (kNN) and Approximate Nearest Neighbors (ANN).

k-Nearest Neighbors (kNN)

Accuracy: kNN is an exact method, meaning it finds the true nearest neighbors based on the distance metric used (e.g., Euclidean distance, cosine similarity).
Complexity: kNN can be computationally expensive, especially for large datasets, as it requires calculating the distance between the query vector and every vector in the dataset.

Depending on the vector size and the number of documents, the computation time can easily exceed configured timeouts resulting in failure.

Approximate Nearest Neighbors (ANN)

Accuracy: This approach finds neighbors that are “close enough”, but not necessarily the “closest”.
Complexity: ANN is much faster and more scalable than kNN, making it suitable for large datasets and real-time applications.

An index is used when implementing ANN. The vector space is divided into smaller regions. The index is then used to identify the region that the provided vector belongs to. Distance computation is then performed for vectors that belong to the same region. This approximation greatly reduces the amount of computation needed to find the nearest neighbors.

On the other hand, this approximation may reduce accuracy by missing out on neighbors belonging to a different region. This difference between the approximated nearest neighbors and the true nearest neighbors is called “recall”. An 80% recall translates to the ability to return 80% of the true nearest neighbors using ANN.

Vector Index

The structure of a vector index is deeply related to the implementation of the ANN algorithm. For MarkLogic Server, the vector index is configured using the TDE template:

'use strict';

declareUpdate();
const tde = require("/MarkLogic/tde.xqy");

let template = {
  "template":{
    "description":"vector template",
    "context":"/envelope/instance",
    "rows":[
      {
        "schemaName":"acme",
        "viewName":"trojan",
        "columns":[
          {
            "name":"id",
            "scalarType":"string",
            "val":"id"
          }, {
            "name":"textEmbedding",
            "scalarType":"vector",
            "val":"vec:vector(textEmbedding)",
            "dimension": "768"
          }
        ]
      }
    ]
  }
}

tde.templateInsert("/acme/trojan.json", xdmp.toJSON(template))

dimension declares the expected size of your vectors. See Best Practices and Considerations regarding concerns on recommended and maximum value for this setting.

val makes use of the function vec:vector() to convert embedding represented as a JSON array-node into the expected vector datatype. For XML documents, the target element is expected to contain a JSON array-node-like string value.

More information about other vector-specific column settings is available at Creating Template Views.

Note

Any change to dimension and any ann-* setting will trigger a re-index of affected documents. Additionally, indexed vector values are normalized by default. For more details, see Index Configuration .

ANN Search via Optic API

Now that our vector values are indexed, we can use the Optic API to perform ANN search using op.annTopK():

'use strict';
const op = require('/MarkLogic/optic');

function getVectorEquivalent(...){ ... }

let queryText = "How did Achilles bring down Troy?"
let queryVector = getVectorEquivalent(queryText)
let k = 10 // top-K, a.k.a. limit
let view = op.fromView("acme", "trojan")
view
  .annTopK(k, op.col('textEmbedding'), queryVector, op.col('distance'))
  .result()

Refer to Use of an Embedding Model regarding definition of the function getVectorEquivalent(). It is important that you use the same LLM model for your embeddings and your query text.

Note that the current implementation is limited to cosine distance. If there is a need for search using euclidean-distance, then a "brute force" kNN approach is still possible. As a reminder, this approach is computationally expensive as it requires calculating the distance between the query vector and every vector in the dataset.

'use strict';
const op = require('/MarkLogic/optic');

function getVectorEquivalent(...){ ... }

let queryText = "How did Achilles bring down Troy?"
let queryVector = getVectorEquivalent(queryText)
let view = op.fromView("acme", "trojan")
view
  .orderBy(op.as('distance', op.vec.euclideanDistance(op.col('textEmbedding'), queryVector)))
  .limit(10)
  .result()

Refer to Use of an Embedding Model regarding definition of the function getVectorEquivalent(). It is important that you use the same LLM model for your embeddings and your query text.

Result Reranking Using Vector Index

The previous attempt at result reranking using vectors only applies vector distance on results that is already filtered by the lexical search. Using the vector index, we can run our lexical and vector search independently, then combine the score from both sources to achieve true reciprocal rank fusion.

'use strict';
const op = require('/MarkLogic/optic');

function getVectorEquivalent(...){ ... }

let queryText = "How did Achilles bring down Troy?"
// we reduce the CTS search terms to those with length higher than 3.
let terms = queryText.toLowerCase().split(' ').filter(term => term.length > 3)
let ctsQuery = cts.wordQuery(terms)

// lexical search using CTS query.
let searchLimit = 10
let search = op.fromSearchDocs(ctsQuery)
  .limit(searchLimit)
  ;

// indexed vector search using query vector
let queryVector = getVectorEquivalent(queryText)
let k = 10 // top-K, a.k.a. limit
let view = op.fromView("acme", "trojan", null, op.fragmentIdCol('$viewFragment'))
  .annTopK(k, op.col('textEmbedding'), queryVector, op.col('distance')) 

let hybridScoreCol = op.as(
    'hybridScore', 
    op.vec.vectorScore(op.col('score'), op.col('distance'))
  )

//result
let resultLimit = 10
search
  .joinInner(
    view,
    op.on(
      op.fragmentIdCol('fragmentId'),
      op.fragmentIdCol('$viewFragment')
    )
  )
  .orderBy(op.desc(hybridScoreCol))
  .limit(resultLimit)
  .result()

The use of .joinInner() translate to a result set that documents appears on both the lexical and vector search. This approach may be too restrictive as the lexical search requires that one of the terms appear in the document to be included.

We can use .joinFullOuter() to allow lexical search and vector search to provide results independently. However, the computation for hybridScore (lines 23 to 26) needs to handle the situation where CTS score or vector distance is null.

'use strict';
const op = require('/MarkLogic/optic');

function getVectorEquivalent(...){ ... }

let queryText = "How did Achilles bring down Troy?"
// we reduce the CTS search terms to those with length higher than 3.
let terms = queryText.toLowerCase().split(' ').filter(term => term.length > 3)
let ctsQuery = cts.wordQuery(terms)

// lexical search using CTS query.
let searchLimit = 5
let search = op.fromSearchDocs(ctsQuery)
  .limit(searchLimit)
  ;

// indexed vector search using query vector
let queryVector = getVectorEquivalent(queryText)
let k = 2 // top-K, a.k.a. limit
let view = op.fromView("acme", "trojan", null, op.fragmentIdCol('$viewFragment'))
  .annTopK(k, op.col('textEmbedding'), queryVector, op.col('distance')) 

const scoreCase = 
  // equivalent of: score ? score : 0
  op.case([ op.when(op.isDefined(op.col('score')), op.col('score')) ], 0)

const distanceCase = 
  // equivalent of: distance ? distance : 2
  op.case([ op.when(op.isDefined(op.col('distance')), op.col('distance')) ], 2)

let hybridScoreCol = op.as(
    'hybridScore', 
    op.vec.vectorScore(scoreCase, distanceCase)
  )

//result
let resultLimit=10
search
  .joinFullOuter(
    view,
    op.on(
      op.fragmentIdCol('fragmentId'),
      op.fragmentIdCol('$viewFragment')
    )
  )
  .orderBy(op.desc(hybridScoreCol))
  .limit(resultLimit)
  .result()

Notice that searchLimit (line 12), k (line 20), and resultLimit(line 38) may be configured to have different values. It is common to have higher values for searchLimit and k compared to your final resultLimit.

The use of .joinFullOuter() allows for documents that match both lexical and vector search to rank higher than documents that only match the lexical or vector search independently.

Tuning Performance and Accuracy

The accuracy of recall is inversely proportional to the speed of execution. The following configuration and search options can help users adjust speed vs accuracy.

Index Configuration

annCompression takes a floating-point value between 0.0 and 1.0. This setting determines the size of the optimized vectors that are stored in the vector index. The default value is 0.5. Higher values will cause bigger indexes and slower searches, but may give more accurate results.

ann-indexed element controls indexing of the configured vector column. This is configuration defaults to true when dimension is configured, false when dimension is omitted. When set to true, extracted values are pre-processed for efficiency. If ann-distance is set to cosine, the extracted values are normalized (See vec.normalize()). When specifically configured to false, all other ann-* configuration is ignored. Configuring a template with dimension not set and ann-indexed set to true will throw an error.

Search

op:annTopK() can be invoked with options provided.

'use strict';
const op = require('/MarkLogic/optic');

function getVectorEquivalent(...){ ... }

let queryText = "How did Achilles bring down Troy?"
let queryVector = getVectorEquivalent(queryText)
let k = 10 // top-K, a.k.a. limit
let options = {
  "search-factor": 1,
  "max-distance": 0.54
}
let view = op.fromView("acme", "trojan")
view
  .annTopK(k, op.col('textEmbedding'), queryVector, op.col('distance'), options) 
  .result()

The option search-factor requires tuning depending on the needs of your application. This option can be set to any floating-point value between 0 and 1. Higher values will result in slower searches that may provide higher results accuracy. Lower values will result in faster searches that may give lower accuracy. This option defaults to 1.

The option max-distance can be used to limit the documents included in the results. The distance is a floating value where a lower value indicates a closer similarity to the provided queryVector. Rows with a distance greater than max-distance will not be returned.

Reranking

When using vec.vectorScore() to compute for the hybrid score, you can toggle the impact of distance and score using additional weight parameters.

let distanceWeight = 1
let weight = .5
let hybridScoreCol = op.as(
    'hybridScore', 
    op.vec.vectorScore(scoreCase, distanceCase, distanceWeight, weight)
  )

The hybridScore is computed using the formula of weight * annScore + (1 - weight) * ctsScore.

distanceWeight is a value that scales the value of distance. A higher value results in a lower annScore.

weight is a floating-point value between 0 and 1. A higher value emphasizes the influence of vector distance, while a lower value emphasizes the influence of CTS score.

Best Practices and Considerations

The following are recommendations for best performance.

When using op:ann-top-k, keep k at a reasonable value (typically less than 100). k determines the number of vector computation performed as well as the number of documents that is included in the response. This cost is amplified further by simultaneous transactions of the same nature.
When using op:ann-top-k, specifying k at a value close to the total number of documents is likely to result in less than the expected number of documents. Due to the process of approximation, it is possible to return less than the N documents when k = N.
When configuring your vector column ("scalarType":"vector"), specify a dimension of 16,000 or below.
When configuring a Query-Based View on top of a view (base view) with a vector column using op:generate-view(), all columns referenced from the base view must have the same name, type (QBV) / scalar-type (TDE), collation, nullable, coordinate-system and invalid-values settings as the columns from the base view. Otherwise, vector search will behave in kNN / brute force fashion. This is especially important for nullable that defaults to false for TDE and true for QBV; and invalid-values that defaults to reject for TDE and skip for QBV.

TDE Template	QBV
"columns":[ { "name":"id", "scalarType":"string", "val":"id", "collation":"http://marklogic.com/collation/" }, { "name":"emb", "scalarType":"vector", "val":"vec:vector(emb)", "dimension": "768" } ...	.generateView('acme','qbvView', [ { "name": "id", "type": "string", "nullable": false, "invalidValues": "reject" }, { "name": "emb", "type": "vector", "nullable": false, "invalidValues": "reject" } ...

TDE Template

QBV

"columns":[
  {
    "name":"id",
    "scalarType":"string",
    "val":"id",
    "collation":"http://marklogic.com/collation/"
  }, {
    "name":"emb",
    "scalarType":"vector",
    "val":"vec:vector(emb)",
    "dimension": "768"
  }
...

.generateView('acme','qbvView', [
    { 
      "name": "id",
      "type": "string",
      "nullable": false,
      "invalidValues": "reject"
    }, { 
      "name": "emb",
      "type": "vector",
      "nullable": false,
      "invalidValues": "reject"
    }
...

Get Started with Optic