tldr: an ANALZYE statement reduced a complicated query's runtime by 99.9%

I've been investigating some health informatics data, including the taxonomy system SNOMED CT and the Australian derivatives, SNOMED CT-AU and the Australian Medicines Terminology. One of the challenges of digital health in Australia is that it can take weeks and lots of signed paperwork in order to get access to development resources and data, even if it is available free of charge.

While waiting for my Ontoserver licensing to come through, I found the SNOMED CT MySQL sample scripts from the Australian Digital Health Agency and thought they would be a good starting point for some exploration.

Unfortunately, I have a significant aversion to MySQL/MariaDB. I've spent too many weekends restoring it with a text editor and a toothpick (extremely 90s sysadmin reference), and discovered too many surprising landmines in the default feature set. Plus, the usual MySQL cluster I use was offline.

I decided I'd port the scripts to the world's most widely-deployed database, SQLite. It's super easy to set up, allows you to iterate fast and is really well documented.

Other than a bunch of MySQL footguns (proving once again that it is the C of databases) like foreign keys not having to be unique, and some limitations due to SQLite not having native functions, the scripts translated into SQLite fairly easily and produced a 2 GB database.

However a monster statement to turn the hyper-normalised data into something actually useable as a data source was taking six hours. In 2021! Well, on a 2016 laptop (Core i5 1.6 GHz). So I shifted it all to a faster VM with a Xeon ES-2690, though still from 2016, and it took... five and a half hours.

A bit of digging around in the EXPLAIN QUERY PLAN (extremely 90s Australian politics reference) showed nothing obvious, but a look through the SQLite Query Optimizer Overview (SQOOO?) identified one opportunity:

To help the optimizer get a more accurate estimate of the work involved in using various indexes, the user may optionally run the ANALYZE command... after making significant changes it might be prudent to rerun ANALYZE.

As I'd just inserted a few million rows into various tables, this seemed like it might help. Four seconds to run ANALYZE, and the large query time dropped from six hours to fifteen seconds, an approximately 99.93% reduction.

I've sometimes thought about SQLite as a "drop in and don't think about it" database, but it's clear that there's lots of little tips to change the very-conservative defaults.

Other useful commands have included PRAGMA journal_mode=WAL and the .timer interactive command.

Now the total runtime to create the database is about fifteen minutes, and I am tempted to try and improve it further, but for something I'm not going to do frequently or use long-term I should probably just enjoy a coffee while it runs.