Running Apache Airflow using CockroachDB as the meta store: work in progress
Background and Motivation
Using CockroachDB as the back end metadata store for Airflow makes sense in environments where downtime, whether unplanned or due to routine maintenance activities (DB software upgrades, schema changes, etc.), is not an option. That is what motivates this note on an initial attempt at getting this running. I make no claim that this is the correct approach, and I realize that others have likely gotten far further here than I have (issues.apache.org/jira/browse/AIRFLOW-4961, for example), but wanted to jot down my notes here nonetheless, in hopes that others will add their thoughts.
Process
The environment for this is an Ubuntu 18.04.5 LTS VM running on a Mac, with PostgreSQL 13.2 running on the VM and CockroachDB v20.2.4 running on the Mac. The first step was to do the Airflow installation:
AIRFLOW_VERSION=2.0.1
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
Next, I followed this procedure to create a DB user, airflow
, in the PostgreSQL DB (I did the same in CockroachDB).
Then, I ran the Airflow init process, using the PostgreSQL instance as the back end. The reason I chose this approach is that when I tried this using CockroachDB, the Alembic migrations were a problem and I didn't get too far. This way, I get a working system and will make a note to deal with those issues later.
export AIRFLOW__CORE__SQL_ALCHEMY_CONN="postgresql+psycopg2://airflow:airflow@localhost:5432/airflow"
airflow db init
With that done, the next step was to fetch the DDL and then the data from the PostgreSQL DB in a format that I could use with CockroachDB. Not too difficult given CockroachDB is very Postgres compatible. Here's this part:
$ sudo su - postgres
$ pg_dump --schema-only airflow > airflow_ddl.sql
$ pg_dump --data-only --inserts airflow > airflow_data.sql
After a little trial and error, it ended up that there are three classes of problems with the DDL, where minor changes were needed. Here they are, from the diff between the airflow_ddl.sql
file and the CockroachDB derivative (a -
at the start of a line indicates that line gets removed, while +
indicates a line that is added):
CREATE SEQUENCE public.ab_permission_id_seq
- AS integer
START WITH 1
INCREMENT BY 1
NO MINVALUE
@@ -20,7 +19,6 @@
CACHE 1;
-ALTER TABLE public.ab_permission_id_seq OWNER TO airflow;
-ALTER TABLE ONLY public.ab_permission ALTER COLUMN id SET DEFAULT nextval('public.ab_permission_id_seq'::regclass);
+ALTER TABLE ONLY public.ab_permission ALTER COLUMN id SET DEFAULT nextval('public.ab_permission_id_seq');
Having made those edits, that DDL can be run within CockroachDB to create all the objects required by Airflow. After that, the data can be loaded. With PostgreSQL installed on the Ubuntu VM, the psql
CLI can be used for this (the IP address of the Mac is 192.168.1.10
, and HAProxy is configured to listen at port 5432
):
$ psql 'postgres://airflow:airflow@192.168.1.10:5432/airflow?sslmode=require' < airflow_ddl_crdb.sql
$ psql 'postgres://airflow:airflow@192.168.1.10:5432/airflow?sslmode=require' < airflow_data.sql
CockroachDB has its own SQLAlchemy dialect, so that must be installed prior to the next step (this assumes that sqlalchemy and psycopg2 have already been installed):
$ sudo pip install sqlalchemy-cockroachdb
Prior to starting Airflow, edit $AIRFLOW_HOME/airflow.cfg
(if AIRFLOW_HOME
is not set, it defaults to $HOME/airflow
), making changes in the following places, where the same -
and +
notation is used to denote the changes:
# Here, the CockroachDB SQLAlchemy dialect is specified.
-sql_alchemy_conn = sqlite:///{AIRFLOW_HOME}/airflow.db
+sql_alchemy_conn = cockroachdb://airflow:airflow@192.168.1.10:5432/airflow?sslmode=require
# I noticed that, after a while, I was seeing some connection related errors, so I reduced this timeout.
# I combined this change with a switch from connecting directly to the CockroachDB node to using
# the HAProxy connection. I'll need to further research which of these cured my problem here.
-sql_alchemy_pool_recycle = 1800
+sql_alchemy_pool_recycle = 60
# When set to True, Airflow will add SKIP LOCKED to some queries, but this is not yet supported
# by CockroachDB. Refer to https://github.com/cockroachdb/cockroach/issues/40476?version=v20.2
-use_row_level_locking = True
+use_row_level_locking = False
Finally, it's time to start Airflow:
nohup airflow scheduler >> scheduler.log 2>&1 </dev/null &
airflow webserver -p 8080 -D
And add an Airflow user:
airflow users create -e mgoddard@localhost.localdomain -f Mike -l Goddard -p airflow -r Admin -u mgoddard
Using a psql
CLI, logged into the CockroachDB instance, the new user's data can be verified:
airflow=> \x
Expanded display is on.
airflow=> select * from ab_user;
-[ RECORD 1 ]----+-----------------------------------------------------------------------------------------------
id | 1
first_name | Mike
last_name | Goddard
username | mgoddard
password | pbkdf2:sha256:150000$rUKpbCRw$180124e4422d0f4e5ebb93bbdeed209bdf0b061dbac80bc15739182a4db77251
active | t
email | mgoddard@localhost.localdomain
last_login | 2021-02-13 13:45:33.380335
login_count | 1
fail_login_count | 0
created_on | 2021-02-13 13:45:20.278198
changed_on | 2021-02-13 13:45:20.278205
created_by_fk |
changed_by_fk |
Time: 3.288 ms
Issues
- The Alembic migrations need to be addressed
- Determine what, if any, consequences there are due to removing the
SKIP LOCKED
clauses - CockroachDB is strict about supporting a single isolation level, serializable. Due to this, the recommended approach to handling commits is to wrap them in retry logic. When using the
cockroachdb
SQLAlchemy dialect, the suggested approach is documented here: cockroachlabs.com/docs/v20.2/build-a-python...