Good Enough Text Search

With the release on Tuesday (May 16, 2023) of version 23.1 of CockroachDB, we all get an unanticipated treat: PostgreSQL-compatible full-text indexing and search! As usual, the CockroachDB docs do a fine job of describing this; they are here: https://www.cockroachlabs.com/docs/dev/full-text-search.html

The motivation behind this brief note is that, after experimenting with the tsvector and tsquery capabilities in 23.1, I've observed a few things I feel are worth passing along. So, without further ado ... an example/demo.

I'm running all this on a MacBook Pro and my first step is to snag a copy of the most recent release, 23.1.1, from here: https://www.cockroachlabs.com/docs/releases/index.html?filters=mac

Then, I install it per https://www.cockroachlabs.com/docs/v23.1/install-cockroachdb-mac.html and, finally, start a single node (insecure mode):

$ cockroach start-single-node --insecure --listen-addr=localhost:26257 --http-addr=localhost:18080 --background

Next, I'll log into the single node CockroachDB instance using the psql CLI (recall that CockroachDB is wire-compatible with Postgres):

$ psql "postgres://root@localhost:26257/defaultdb"

The data set I used for this exercise is the current CockroachDB docs, which are available here: https://github.com/cockroachdb/docs . Once these are checked out from GitHub, they can be built by running the following (from within a new iTerm tab). You don't need to do this unless you want to duplicate this experiment, though. Here's the build step:

$ cd ./docs/
$ make cockroachdb

After that build completes, the resulting docs for the 23.1 release will be in the _site/docs/v23.1 sub-directory. We'll make use of these in a minute but, first, we need to go back to the psql CLI and create the DB objects we need for this:

defaultdb=> CREATE TABLE crdb_docs
(
  uri TEXT PRIMARY KEY
  , content TEXT
  , tsvec TSVECTOR AS (TO_TSVECTOR('english', content)) STORED
  , FAMILY cf1 (uri, tsvec)
  , FAMILY cf2 (content)
);
CREATE INDEX ON crdb_docs USING GIN (tsvec);
CREATE TABLE
Time: 70.673 ms
CREATE INDEX
Time: 2539.335 ms (00:02.539)

Looking at that DDL, the justification for writing this post begins to emerge:

Note that the tsvec column is generated and STORED. That is because it plays a role both in testing for a match and also in scoring. If it wasn't stored here, it would have to be recomputed during scoring (aka ranking), which is expensive and would slow down searches.
The use of column families (FAMILY ...) here provides an optimization: the content column, potentially large as it contains the entire text of each HTML page, is stored separately from the uri and tsvec columns since it is only accessed once a user selects a particular search result. This will also help to speed up the search process.

With that table defined (within the DB defaultdb, which is always present in a new CockroachDB installation), we can return to the Bash shell in the other iTerm tab and go about loading the data. Well, almost. First, I should show the Python script which will help with the task of reading each of the HTML docs files, stripping the text from the HTML, and inserting it into the table:

#!/usr/bin/env python3

import psycopg2
from bs4 import BeautifulSoup
import sys
import os
import re
import time

base_url = "https://www.cockroachlabs.com/docs/dev"

if len(sys.argv) < 2:
  print("Usage: %s file.html [file2.html ... fileN.html]" % sys.argv[0])
  sys.exit(1)

db_url = os.getenv("DB_URL")
if db_url is None:
  print("DB_URL must be set")
  sys.exit(1)

def index_file(in_file):
  html = ""
  with open(in_file, mode='rt') as f:
    for line in f:
      html += line
  in_file = re.sub(r'\./', '', in_file)
  soup = BeautifulSoup(html, 'html.parser')
  text = soup.get_text()
  sql = "INSERT INTO crdb_docs (uri, content) VALUES (%s, %s)"
  with conn.cursor() as cur:
    if base_url is not None:
      in_file = base_url + '/' + in_file
    cur.execute(sql, (in_file, text))
    conn.commit()

conn = psycopg2.connect(db_url)
t0 = time.time()
for in_file in sys.argv[1:]:
  print("Indexing file " + in_file + " now ...")
  index_file(in_file)
t1 = time.time()
print("Total time: " + str(t1 - t0) + " s")
conn.close()

That Python script is saved into the file html_indexer.py. With that saved into /tmp/, we can proceed to index these HTML docs:

$ export DB_URL="postgres://root@localhost:26257/defaultdb"
$ cd docs/_site/docs/v23.1/
$ chmod +x /tmp/html_indexer.py
$ /tmp/html_indexer.py $( find . -name '*.html' )
Indexing file ./bulk-update-data.html now ...
Indexing file ./monitor-and-debug-changefeeds.html now ...
Indexing file ./pg-catalog.html now ...
[...]
Indexing file ./create-security-certificates-custom-ca.html now ...
Indexing file ./tsvector.html now ...
Total time: 87.56131410598755 s

Now, back to the iTerm tab where psql is running. First, we verify the row count matches the number of HTML docs:

defaultdb=> select count(*) from crdb_docs;
 count
-------
   546
(1 row)

Time: 20.245 ms

Looks good. Now, on to the fun part -- let's search for docs matching the two terms Kubernetes operator:

defaultdb=> WITH q_parse AS
(
  SELECT PLAINTO_TSQUERY('english', 'Kubernetes operator') q
),
q_bool AS
(
  SELECT uri, (TS_RANK(tsvec, q, 1) * 100.0)::NUMERIC(9, 3) score
  FROM crdb_docs, q_parse
  WHERE tsvec @@ q
)
SELECT * FROM q_bool
WHERE score > 0.1
ORDER BY score DESC
LIMIT 10;
                                                uri                                                | score
---------------------------------------------------------------------------------------------------+-------
 https://www.cockroachlabs.com/docs/dev/deploy-cockroachdb-with-kubernetes.html                    | 6.313
 https://www.cockroachlabs.com/docs/dev/schedule-cockroachdb-kubernetes.html                       | 5.857
 https://www.cockroachlabs.com/docs/dev/kubernetes-overview.html                                   | 4.256
 https://www.cockroachlabs.com/docs/dev/deploy-cockroachdb-with-kubernetes-insecure.html           | 4.002
 https://www.cockroachlabs.com/docs/dev/orchestrate-cockroachdb-with-kubernetes-multi-cluster.html | 3.733
 https://www.cockroachlabs.com/docs/dev/kubernetes-performance.html                                | 3.432
 https://www.cockroachlabs.com/docs/dev/secure-cockroachdb-kubernetes.html                         | 3.217
 https://www.cockroachlabs.com/docs/dev/orchestrate-a-local-cluster-with-kubernetes.html           | 1.746
 https://www.cockroachlabs.com/docs/dev/cloud-storage-authentication.html                          | 1.707
 https://www.cockroachlabs.com/docs/dev/deploy-cockroachdb-with-kubernetes-openshift.html          | 1.689
(10 rows)

Time: 92.317 ms

We got our 10 results, ordered by relevance to the search terms. Great! Let's go over each section of that SQL query:

The first common table expression (CTE), named q_parse, is there to avoid having to type the query twice. In your code, you'd use bind variables, so would omit this CTE and just reference the bind variable in the two places it's needed. This just converts the query string into a tsquery instance, which is required to (a) test for matches and (b) score the results.
The second CTE, q_bool, is where the @@ operator provides the boolean result of whether or not the row matches the search criteria and is also where a rank (or score) is computed for each matching row, using TS_RANK(tsvec, q, 1). The 1 occupying the third slot to that function specifies a normalization function to use when ranking the results; it says to divide the rank by (1 + the logarithm of the document length). https://www.postgresql.org/docs/current/textsearch-controls.html details other available normalization options.
Finally, the results are sorted in descending order of score, and I added a score cutoff of 0.1 and a LIMIT of 10 results. The second CTE is only required because I wanted to add that score cutoff.

The purpose of the GIN (inverted) index is to accelerate these text search queries. Let's get an EXPLAIN plan to verify it's being used:

defaultdb=> SET enable_zigzag_join = false;
SET
Time: 1.496 ms
defaultdb=> EXPLAIN WITH q_parse AS
(
  SELECT PLAINTO_TSQUERY('english', 'Kubernetes operator') q
),
q_bool AS
(
  SELECT uri, (TS_RANK(tsvec, q, 1) * 100.0)::NUMERIC(9, 3) score
  FROM crdb_docs, q_parse
  WHERE tsvec @@ q
)
SELECT * FROM q_bool
WHERE score > 0.1
ORDER BY score DESC
LIMIT 10;
                                              info
------------------------------------------------------------------------------------------------
 distribution: local
 vectorized: true

 • top-k
 │ estimated row count: 10
 │ order: -score
 │ k: 10
 │
 └── • filter
     │ estimated row count: 61
     │ filter: score > 0.1
     │
     └── • render
         │
         └── • index join
             │ estimated row count: 182
             │ table: crdb_docs@crdb_docs_pkey
             │
             └── • inverted filter
                 │ estimated row count: 497
                 │ inverted column: tsvec_inverted_key
                 │ num spans: 2
                 │
                 └── • scan
                       estimated row count: 497 (91% of the table; stats collected 2 hours ago)
                       table: crdb_docs@crdb_docs_tsvec_idx
                       spans: 2 spans
(27 rows)

Time: 4.698 ms

Yes, the optimizer is choosing to use the index. Note that I disabled the use of zigzag joins, which I've found to have an impact on the performance of these queries, but only with larger data sets. Still, I wanted to point that out so you know to experiment with it.

I'm excited about this new feature and wanted to share it. Is CockroachDB intended to be a search engine? No, but now it can help when you need text search but aren't ready to dedicate resources specifically for that purpose. For a more full-featured text indexing and search experience, I'd still recommend using CockroachDB changefeeds (CDC) to keep your Elasticsearch (or Typesense, or whatever you prefer) in sync with your transactional data in CockroachDB.

Thank you for having a read through this post! I hope it was worthwhile and look forward to the next one.

Good Enough Text Search

Table of contents

No headings in the article.