Good Enough Text Search
Table of contents
No headings in the article.
With the release on Tuesday (May 16, 2023) of version 23.1 of CockroachDB, we all get an unanticipated treat: PostgreSQL-compatible full-text indexing and search! As usual, the CockroachDB docs do a fine job of describing this; they are here: https://www.cockroachlabs.com/docs/dev/full-text-search.html
The motivation behind this brief note is that, after experimenting with the tsvector and tsquery capabilities in 23.1, I've observed a few things I feel are worth passing along. So, without further ado ... an example/demo.
I'm running all this on a MacBook Pro and my first step is to snag a copy of the most recent release, 23.1.1, from here: https://www.cockroachlabs.com/docs/releases/index.html?filters=mac
Then, I install it per https://www.cockroachlabs.com/docs/v23.1/install-cockroachdb-mac.html and, finally, start a single node (insecure mode):
$ cockroach start-single-node --insecure --listen-addr=localhost:26257 --http-addr=localhost:18080 --background
Next, I'll log into the single node CockroachDB instance using the psql CLI (recall that CockroachDB is wire-compatible with Postgres):
$ psql "postgres://root@localhost:26257/defaultdb"
The data set I used for this exercise is the current CockroachDB docs, which are available here: https://github.com/cockroachdb/docs . Once these are checked out from GitHub, they can be built by running the following (from within a new iTerm tab). You don't need to do this unless you want to duplicate this experiment, though. Here's the build step:
$ cd ./docs/
$ make cockroachdb
After that build completes, the resulting docs for the 23.1 release will be in the _site/docs/v23.1
sub-directory. We'll make use of these in a minute but, first, we need to go back to the psql CLI and create the DB objects we need for this:
defaultdb=> CREATE TABLE crdb_docs
(
uri TEXT PRIMARY KEY
, content TEXT
, tsvec TSVECTOR AS (TO_TSVECTOR('english', content)) STORED
, FAMILY cf1 (uri, tsvec)
, FAMILY cf2 (content)
);
CREATE INDEX ON crdb_docs USING GIN (tsvec);
CREATE TABLE
Time: 70.673 ms
CREATE INDEX
Time: 2539.335 ms (00:02.539)
Looking at that DDL, the justification for writing this post begins to emerge:
Note that the
tsvec
column is generated andSTORED
. That is because it plays a role both in testing for a match and also in scoring. If it wasn't stored here, it would have to be recomputed during scoring (aka ranking), which is expensive and would slow down searches.The use of column families (
FAMILY
...) here provides an optimization: thecontent
column, potentially large as it contains the entire text of each HTML page, is stored separately from theuri
andtsvec
columns since it is only accessed once a user selects a particular search result. This will also help to speed up the search process.
With that table defined (within the DB defaultdb, which is always present in a new CockroachDB installation), we can return to the Bash shell in the other iTerm tab and go about loading the data. Well, almost. First, I should show the Python script which will help with the task of reading each of the HTML docs files, stripping the text from the HTML, and inserting it into the table:
#!/usr/bin/env python3
import psycopg2
from bs4 import BeautifulSoup
import sys
import os
import re
import time
base_url = "https://www.cockroachlabs.com/docs/dev"
if len(sys.argv) < 2:
print("Usage: %s file.html [file2.html ... fileN.html]" % sys.argv[0])
sys.exit(1)
db_url = os.getenv("DB_URL")
if db_url is None:
print("DB_URL must be set")
sys.exit(1)
def index_file(in_file):
html = ""
with open(in_file, mode='rt') as f:
for line in f:
html += line
in_file = re.sub(r'\./', '', in_file)
soup = BeautifulSoup(html, 'html.parser')
text = soup.get_text()
sql = "INSERT INTO crdb_docs (uri, content) VALUES (%s, %s)"
with conn.cursor() as cur:
if base_url is not None:
in_file = base_url + '/' + in_file
cur.execute(sql, (in_file, text))
conn.commit()
conn = psycopg2.connect(db_url)
t0 = time.time()
for in_file in sys.argv[1:]:
print("Indexing file " + in_file + " now ...")
index_file(in_file)
t1 = time.time()
print("Total time: " + str(t1 - t0) + " s")
conn.close()
That Python script is saved into the file html_indexer.py. With that saved into /tmp/
, we can proceed to index these HTML docs:
$ export DB_URL="postgres://root@localhost:26257/defaultdb"
$ cd docs/_site/docs/v23.1/
$ chmod +x /tmp/html_indexer.py
$ /tmp/html_indexer.py $( find . -name '*.html' )
Indexing file ./bulk-update-data.html now ...
Indexing file ./monitor-and-debug-changefeeds.html now ...
Indexing file ./pg-catalog.html now ...
[...]
Indexing file ./create-security-certificates-custom-ca.html now ...
Indexing file ./tsvector.html now ...
Total time: 87.56131410598755 s
Now, back to the iTerm tab where psql is running. First, we verify the row count matches the number of HTML docs:
defaultdb=> select count(*) from crdb_docs;
count
-------
546
(1 row)
Time: 20.245 ms
Looks good. Now, on to the fun part -- let's search for docs matching the two terms Kubernetes operator:
defaultdb=> WITH q_parse AS
(
SELECT PLAINTO_TSQUERY('english', 'Kubernetes operator') q
),
q_bool AS
(
SELECT uri, (TS_RANK(tsvec, q, 1) * 100.0)::NUMERIC(9, 3) score
FROM crdb_docs, q_parse
WHERE tsvec @@ q
)
SELECT * FROM q_bool
WHERE score > 0.1
ORDER BY score DESC
LIMIT 10;
uri | score
---------------------------------------------------------------------------------------------------+-------
https://www.cockroachlabs.com/docs/dev/deploy-cockroachdb-with-kubernetes.html | 6.313
https://www.cockroachlabs.com/docs/dev/schedule-cockroachdb-kubernetes.html | 5.857
https://www.cockroachlabs.com/docs/dev/kubernetes-overview.html | 4.256
https://www.cockroachlabs.com/docs/dev/deploy-cockroachdb-with-kubernetes-insecure.html | 4.002
https://www.cockroachlabs.com/docs/dev/orchestrate-cockroachdb-with-kubernetes-multi-cluster.html | 3.733
https://www.cockroachlabs.com/docs/dev/kubernetes-performance.html | 3.432
https://www.cockroachlabs.com/docs/dev/secure-cockroachdb-kubernetes.html | 3.217
https://www.cockroachlabs.com/docs/dev/orchestrate-a-local-cluster-with-kubernetes.html | 1.746
https://www.cockroachlabs.com/docs/dev/cloud-storage-authentication.html | 1.707
https://www.cockroachlabs.com/docs/dev/deploy-cockroachdb-with-kubernetes-openshift.html | 1.689
(10 rows)
Time: 92.317 ms
We got our 10 results, ordered by relevance to the search terms. Great! Let's go over each section of that SQL query:
The first common table expression (CTE), named
q_parse
, is there to avoid having to type the query twice. In your code, you'd use bind variables, so would omit this CTE and just reference the bind variable in the two places it's needed. This just converts the query string into a tsquery instance, which is required to (a) test for matches and (b) score the results.The second CTE,
q_bool
, is where the@@
operator provides the boolean result of whether or not the row matches the search criteria and is also where a rank (or score) is computed for each matching row, usingTS_RANK(tsvec, q, 1)
. The1
occupying the third slot to that function specifies a normalization function to use when ranking the results; it says to divide the rank by (1 + the logarithm of the document length). https://www.postgresql.org/docs/current/textsearch-controls.html details other available normalization options.Finally, the results are sorted in descending order of score, and I added a score cutoff of 0.1 and a
LIMIT
of 10 results. The second CTE is only required because I wanted to add that score cutoff.
The purpose of the GIN (inverted) index is to accelerate these text search queries. Let's get an EXPLAIN
plan to verify it's being used:
defaultdb=> SET enable_zigzag_join = false;
SET
Time: 1.496 ms
defaultdb=> EXPLAIN WITH q_parse AS
(
SELECT PLAINTO_TSQUERY('english', 'Kubernetes operator') q
),
q_bool AS
(
SELECT uri, (TS_RANK(tsvec, q, 1) * 100.0)::NUMERIC(9, 3) score
FROM crdb_docs, q_parse
WHERE tsvec @@ q
)
SELECT * FROM q_bool
WHERE score > 0.1
ORDER BY score DESC
LIMIT 10;
info
------------------------------------------------------------------------------------------------
distribution: local
vectorized: true
• top-k
│ estimated row count: 10
│ order: -score
│ k: 10
│
└── • filter
│ estimated row count: 61
│ filter: score > 0.1
│
└── • render
│
└── • index join
│ estimated row count: 182
│ table: crdb_docs@crdb_docs_pkey
│
└── • inverted filter
│ estimated row count: 497
│ inverted column: tsvec_inverted_key
│ num spans: 2
│
└── • scan
estimated row count: 497 (91% of the table; stats collected 2 hours ago)
table: crdb_docs@crdb_docs_tsvec_idx
spans: 2 spans
(27 rows)
Time: 4.698 ms
Yes, the optimizer is choosing to use the index. Note that I disabled the use of zigzag joins, which I've found to have an impact on the performance of these queries, but only with larger data sets. Still, I wanted to point that out so you know to experiment with it.
I'm excited about this new feature and wanted to share it. Is CockroachDB intended to be a search engine? No, but now it can help when you need text search but aren't ready to dedicate resources specifically for that purpose. For a more full-featured text indexing and search experience, I'd still recommend using CockroachDB changefeeds (CDC) to keep your Elasticsearch (or Typesense, or whatever you prefer) in sync with your transactional data in CockroachDB.
Thank you for having a read through this post! I hope it was worthwhile and look forward to the next one.