Index ¦ Archives ¦ RSS > Tag: sqlite

SQLite configuration

SQLite is really easy to compile into your application. However Linux and BSD maintainers prefer to have a single centralized shared library and have all platform provided programs use that. You can see a (very long) list of platforms and their SQLite version here.

SQLite can be compiled with many options such as full text search which almost all seem to enable, even though it defaults off. That got me curious to see what they all do different than the default.

I went off and investigated, with the per platform details at the end.

Commentary

It was noticeable just how arbitrary each platform seemed. With the exception of OpenBSD, every platform had several configuration settings to their defaults. This does not result in any warnings of errors. It is generally the result of things like JSON which used to be off by default, and had to be explicitly enabled, but is now on by default. I don't know of any reasonable way for maintainers to find out that they are redundantly setting things to their default values.

COLUMN_METADATA

Column metadata is enabled by everyone. I did see some comments saying that Firefox needs it (and UNLOCK_NOTIFY), but it is enabled even by platforms that don't ship Firefox.

My own experience with Python developers has been that many also use COLUMN_METADATA, and I suspect many other dynamic languages like Ruby, Perl, and Lua do too. The problem is you can't map the text of a SQL query back to named columns because SQL queries are too difficult to parse and manipulate. Other databases provide the column and table names for query results, so code out there expects it. Even the Python database spec (DBAPI aka PEP249) requires it.

Best practice is to explicitly name the result columns (using AS) and this works without requiring COLUMN_METADATA.

SELECT name AS name, SUM(abc) AS abc_count, two.name AS two_name
  FROM contacts, two;

Extensions

FTS3 and FTS4 are enabled by default, but FTS5 is not. There is a good description of the differences. I measured how big extensions compiled on my system are, as a percentage of the SQLite library with all extensions enabled (1.6MB)

Extension Size Percent
FTS3 & 4 93kb 6%
FTS5 200kb 13%
JSON 50kb 3%
Session 53kb 3%
Soundex 813 bytes 0%

MAX_VARIABLE_NUMBER

This I can't make any sense of. You can use numbered bindings, that start at 1.

SELECT name, email FROM contacts WHERE name = ? AND age < ?;

When programmatically executing this, you would supply bindings number 1 and 2. It is also possible to explicitly number them, so in this example you'd provide 97 of them although only 73 and 97 are used.

SELECT ?73, ?97;

You can also provide bindings by name using :, $, or @ like this, providing the bindings as a hash/dictionary type object to the API:

SELECT :foo, $bar, @bam;

The default limit on numbered bindings is 32,767. I cannot think of any scenario under which you would want there to be a quarter or half a million of them, yet that happens!

USE_URI

URI filenames let you easily provide extra parameters when opening a database. You can programmatically turn on URI filenames, but unfortunately that has to be done very first thing after loading SQLite but before using it which is hard to arrange. You can also use SQLITE_OPEN_URI when calling sqlite3_open_v2 added in 2007.

I'm guessing this default was added to some platforms years ago and never changed.

Security

SECURE_DELETE ensures deleted data is overwritten with zeroes. Without it, old data can still be visible to anyone examining the database file contents. It can also be controlled by a pragma. There is a good case to be made for it to be on by default to avoid unintentionally leaking data, for example when databases are backed up, or shared with other people.

API_ARMOR helps defend against programming errors, which are especially easy to make in C. As a maintainer I would enable it, because a more complex program using SQLite often uses lots of libraries, and all it takes is a bug in any one of them to cause problems. Every few months an issue comes up in the SQLite forums where that has happened.

STRICT_SUBTYPE is off by default, with threats in the SQLite code that it may be turned on by default in the future. Maintainers should probably turn this on.

LIKE_DOESNT_MATCH_BLOBS should be on by default, because using LIKE doesn't make sense against blobs. For example what encoding should be used?

DQS isn't set by anyone, but really should be. The default setting allows developers to accidentally use double quotes for strings when they should be used for identifiers (column and table names etc) only. This is a frequent cause of bugs on the SQLite forums, because programming languages use double quotes, and it is easy to assume SQL does too. It doesn't.

Miscellaneous

STAT4 should get you better queries with multi-column indices.

I can't imagine why offset would be needed.

MAX_SCHEMA_RETRY defaults to 50, so I don't know why half that amount would make any difference.

Summary

Other than COLUMN_METADATA, there isn't much consistency between the platforms. SQLite is trying to serve many users, from tiny devices with trivial amounts of memory and storage, to gigantic servers with terabytes of databases, and corresponding amounts of CPU and RAM. Heck the default cache size is 2MB which is wrong for everyone.

So far platform maintainers are making decisions about what to configure for their platform, but that does result in redundancy, duplicated effort, and some amount of guessing.

SQLite doesn't tend to change things because there is an implicit guarantee that if you have a working program using SQLite, and you upgrade the SQLite library, then your program will continue to work.

There are recommended options which are appropriate when compiling SQLite for your own application, but not usable when providing SQLite for a platform used by many applications.

Details

Each entry shows what looked different to me as I did this by visual inspection looking at the default build rules for each platform. You can get more information by looking at https://www.sqlite.org/compile.html and I've left off SQLITE, ENABLE etc from the names. The most recent version of the platform is what I looked at,

You can use pragma compile_options on your own platform to see what is in use. SQLite can have different defaults between the plain library and the cli making this a little more confusing.

Platform Configuration difference
MacOS
CKPTFULLFSYNC
JOURNAL_SIZE_LIMIT=32768
LOOKASIDE=1200,102
COLUMN_METADATA
FTS5
LOCKING_STYLE=1
NORMALIZE
PREUPDATE_HOOK
SESSION
SNAPSHOT
SQLLOG
STMT_SCANSTATUS
UPDATE_DELETE_LIMIT
HAS_CODEC_RESTRICTED
MAX_VARIABLE_NUMBER=500000
THREADSAFE=2
USE_URI
Debian
SECURE_DELETE
COLUMN_METADATA
SOUNDEX
UNLOCK_NOTIFY
ALLOW_ROWID_IN_VIEW
UPDATE_DELETE_LIMIT
LOAD_EXTENSION
LIKE_DOESNT_MATCH_BLOBS
USE_URI
MAX_SCHEMA_RETRY=25
PREUPDATE_HOOK
SESSION
STRICT_SUBTYPE
MAX_VARIABLE_NUMBER=250000
Fedora
COLUMN_METADATA
DISABLE_DIRSYNC
SECURE_DELETE
UNLOCK_NOTIFY
Gentoo
LOAD_EXTENSION
API_ARMOR
COLUMN_METADATA
EXPLAIN_COMMENTS
HIDDEN_COLUMNS
MEMSYS5
NORMALIZE
OFFSET_SQL_FUNC
PREUPDATE_HOOK
GEOPOLY
SESSION
STMT_SCANSTATUS
UNLOCK_NOTIFY
UPDATE_DELETE_LIMIT
SOUNDEX
USE_URI
Arch
COLUMN_METADATA
UNLOCK_NOTIFY
SECURE_DELETE
STAT4
MAX_VARIABLE_NUMBER=250000
MAX_EXPR_DEPTH=10000
FTS5
Termux
COLUMN_METADATA
UPDATE_DELETE_LIMIT
FreeBSD
DEFAULT_FILE_PERMISSIONS
DIRECT_READ
LOAD_EXTENSION
FTS5
COLUMN_METADATA
SECURE_DELETE
UNLOCK_NOTIFY
USE_URI
OpenBSD
FTS5
UNLOCK_NOTIFY
COLUMN_METADATA
Brew (homebrew)
API_ARMOR
COLUMN_METADATA
FTS5
MEMORY_MANAGEMENT
STAT4
UNLOCK_NOTIFY
MAX_VARIABLE_NUMBER=250000
USE_URI
PREUPDATE_HOOK
SESSION
Macports
DISABLE_INTRINSIC
COLUMN_METADATA
FTS5
SECURE_DELETE
STAT4
UNLOCK_NOTIFY
SOUNDEX
Nix
COLUMN_METADATA
FTS5
STMT_SCANSTATUS
UNLOCK_NOTIFY
SOUNDEX
SECURE_DELETE
MAX_VARIABLE_NUMBER=250000
MAX_EXPR_DEPTH=10000
Haikuports
COLUMN_METADATA
UNLOCK_NOTIFY
SECURE_DELETE
FTS5
msys2
COLUMN_METADATA
DISABLE_DIRSYNC
SOUNDEX
PREUPDATE_HOOK
SESSION
(They also configure --enable-all)

Category: misc – Tags: sqlite


SQLite Full Text Search added to APSW

TLDR: APSW now has comprehensive support for SQLite Full Text Search.

SQLite has a nice full text search search implementation named FTS5. (Yes the older FTS3 and 4 also exist.)

Full text search engines are in theory fairly simple - break content into terms (typically words), index where in the content the terms exist, reply to a query's terms by consulting that index. Add in a way of ranking the results typically by how rare each term is overall, and how well represented it is in an item of content. (The 5 decade old BM25 can do that.) Make the query language a bit more complex so you have AND, OR, and NOT. Perhaps a NEAR and a way of including or excluding columns.

FTS5 does all that very well, and is fast. It offers a C API for writing your own tokenizers (converting content in terms/words), and for writing your own auxiliary functions such as for ranking or showing highlights. Several months ago I decided to wrap those APIs in APSW figuring it would take a few weeks at most. (C code requires lots of testing.) It did of course take several months, but the result is something I am very proud of.

The tokenizer API is a little confusing. It is how object oriented C is done where you have structures of methods, passing this as the first parameter. I did get confused over what were the analogues to classes and instances, and had to revise my first implementation.

The FTS5 tokenizer is based on Unicode 6.1 (2012) and stays that way for stability. We are now in version 16 with there being annual updates. Python has the unicodedata module which is more up to date, so I implemented a tokenizer using that as the data source.

I soon discovered that it wasn't really workable for two reasons. The first is that splitting text into words by whitespace doesn't work. Some popular languages like Chinese and Japanese don't use spaces, some use spaces between syllables, some use other schemes, and even English gets complicated when you have punctuation - is don't one or two words? What about run-down?

The second is that operating on each codepoint separately breaks apart what are single user perceived characters (aka grapheme clusters). You can have a codepoint and one or more combining accents and marks. It was especially noticeable with emoji where 🤦🏼‍♂️ is actually 5 different codepoints.

That led me to Unicode Technical Report #29 all about text segmentation. It tells you how to correctly break text into grapheme clusters, words, and sentences. The FTS5 snippet function tries to determine sentences behind the scenes, so all 3 would be useful. I also have a function to format query tables that needs to break text into lines. For example in English you don't want to break between a word and a comma immediately following it, but rather after the comma. Unicode Technical Report #14 line breaking algorithm addresses that. There were no practical Python libraries implementing all of these, so I went off joining the very small club who had implemented all of TR29 & 14.

With that lengthy diversion over I was able to implement a UnicodeWords tokenizer. It performs really well no matter language, punctuation, and (lack of) whitespace. Just kidding - I had yet another diversion. I needed Unicode information, way beyond what is in the unicodedata module. For example I needed to handle extended pictographic (emoji and similar) correctly. I needed to be able to strip accents, diacritics, and combining marks. To get my formatted query tables correctly aligned on the terminal I needed width information. And it bugged me that unicodedata lags the Unicode standard by years. I stay up to date with SQLite, so why shouldn't I stay up to date with Unicode. That resulted in apsw.unicode and now UnicodeWords does well.

Then it was off to wrap the auxiliary function API. That was straightforward, although it has a lot of terminology, with my feedback resulting in the overview being added.

At this point all the pieces were in place for anyone to use APSW to get the most out of FTS5 from Python. But you had to build everything yourself. I would have to document how you do things like stop words, stemming, or synonyms. They are only in the region of 10 lines of code which was confirmation that the API was good, and easier to to just go ahead and implement.

The builtin trigram tokenizer was not aware of grapheme clusters, so I also had to add an ngram tokenizer. (It irritates me how some sites require typing 3 characters before they do completions, so I made sure any value can be used.) I also ended up adding library code to deal with argument parsing that all the tokenizers needed. And one to deal with stripping accents and case folding for any other tokenizer so the functionality didn't have to be built in to each individually.

The SQLite website's own search uses its HTML documentation as what is fed to FTS5. There is no provided HTML tokenizer, and I saw several questions on the forum about indexing HTML content. Python has an HTML parser so I used that for the parsing, but it was also important to track the offsets in UTF8 of the source document to the tokens. This gets fun with entity and character references where there is no one to one mapping of offsets. That led to more C code to manage offset mapping, as pure Python code was too slow. Some people on the forum also wanted JSON tokenizers so I made one of those too.

I wanted a Python class to wrap a FTS5 table, especially so you could call methods and access properties rather than constructing SQL. It become necessary to parse the SQL making up the virtual table declaration such as getting column names and the various options. That required parsing code, especially as there are many ways quoting can be done. The tokenize argument is a quoted string containing quoted strings, so it gets hairy. (Far more quoting options work than the documentation says.)

I also wanted functionality to do query suggestion - if you make a spelling mistake in a query then it should suggest a better alternative. Even spelling something correctly could be a rare spelling and something more popular is appropriate. To implement that required being able to parse and manipulate (and create) queries. That required dealing with the finer points of the FTS5 query grammar, implicit AND, and should you have a PHRASES type (answer: no). It was then possible using the token information to implement query suggestion, which was fantastic to then use.

I had one remaining item on my wish list - more_like where given some existing rows, you can come up with additional ones that are like those. This makes it easy to implement infinite scrolling - just more_like what you have already shown each time. It was also great when testing on a large recipe database - given ingredients from one recipe it would suggest similar ones.

The query suggestion and more like are purely statistical - they have no understanding of meaning such as jogger being similar to runner. But they still work well enough. Solutions that do have that understanding require you to start measuring in gigabytes which going against the Lite part of SQLite.

I also had to write documentation, with a lot of things to cover. The final result when printed would be about 50 pages. Various reading time tools estimate about 50 minutes to read. I had to read it all multiple times for proofreading and editing.

Now that all is done I'll provide some statistics. Measuring lines of code is somewhat meaningful as being correlated with development time, complexity, number of bugs, testing effort etc. It is also as silly as measuring a painting by weight. Items marked 🆕 are what I had to write as described above, while the others provide context.

Lines of code Description
2,100 🆕Python test code for FTS5 and Unicode functionality
8,400 Python test code for everything else
1,100 🆕C code to glue FTS5 API to Python C API
1,200 C code to glue SQLite prepared statements and execution to Python C API
1,750 C code to glue SQLite virtual tables to Python C API
2,000 🆕C code to implement TR29 text segmentation, TR14 line breaking, case folding, accent stripping, UTF8 to str offset mapping, and various related functionality
2,250 C code to SQLite virtual file system (VFS) to Python C API
1,100 🆕Python code to provide API to unicode functionality, text wrapping, and similar, plus command line tool
860 🆕Python code for wrapping FTS5 table, all the tokenizers, argument parsing, and a command line tool.
1,200 🆕Python tooling code to download and parse Unicode data files and build the tables in the source which aren't included above.

Category: misc – Tags: unicode, python, apsw, sqlite

Contact me