Index ¦ Archives ¦ RSS > Tag: unicode

SQLite Full Text Search added to APSW

TLDR: APSW now has comprehensive support for SQLite Full Text Search.

SQLite has a nice full text search search implementation named FTS5. (Yes the older FTS3 and 4 also exist.)

Full text search engines are in theory fairly simple - break content into terms (typically words), index where in the content the terms exist, reply to a query's terms by consulting that index. Add in a way of ranking the results typically by how rare each term is overall, and how well represented it is in an item of content. (The 5 decade old BM25 can do that.) Make the query language a bit more complex so you have AND, OR, and NOT. Perhaps a NEAR and a way of including or excluding columns.

FTS5 does all that very well, and is fast. It offers a C API for writing your own tokenizers (converting content in terms/words), and for writing your own auxiliary functions such as for ranking or showing highlights. Several months ago I decided to wrap those APIs in APSW figuring it would take a few weeks at most. (C code requires lots of testing.) It did of course take several months, but the result is something I am very proud of.

The tokenizer API is a little confusing. It is how object oriented C is done where you have structures of methods, passing this as the first parameter. I did get confused over what were the analogues to classes and instances, and had to revise my first implementation.

The FTS5 tokenizer is based on Unicode 6.1 (2012) and stays that way for stability. We are now in version 16 with there being annual updates. Python has the unicodedata module which is more up to date, so I implemented a tokenizer using that as the data source.

I soon discovered that it wasn't really workable for two reasons. The first is that splitting text into words by whitespace doesn't work. Some popular languages like Chinese and Japanese don't use spaces, some use spaces between syllables, some use other schemes, and even English gets complicated when you have punctuation - is don't one or two words? What about run-down?

The second is that operating on each codepoint separately breaks apart what are single user perceived characters (aka grapheme clusters). You can have a codepoint and one or more combining accents and marks. It was especially noticeable with emoji where 🤦🏼‍♂️ is actually 5 different codepoints.

That led me to Unicode Technical Report #29 all about text segmentation. It tells you how to correctly break text into grapheme clusters, words, and sentences. The FTS5 snippet function tries to determine sentences behind the scenes, so all 3 would be useful. I also have a function to format query tables that needs to break text into lines. For example in English you don't want to break between a word and a comma immediately following it, but rather after the comma. Unicode Technical Report #14 line breaking algorithm addresses that. There were no practical Python libraries implementing all of these, so I went off joining the very small club who had implemented all of TR29 & 14.

With that lengthy diversion over I was able to implement a UnicodeWords tokenizer. It performs really well no matter language, punctuation, and (lack of) whitespace. Just kidding - I had yet another diversion. I needed Unicode information, way beyond what is in the unicodedata module. For example I needed to handle extended pictographic (emoji and similar) correctly. I needed to be able to strip accents, diacritics, and combining marks. To get my formatted query tables correctly aligned on the terminal I needed width information. And it bugged me that unicodedata lags the Unicode standard by years. I stay up to date with SQLite, so why shouldn't I stay up to date with Unicode. That resulted in apsw.unicode and now UnicodeWords does well.

Then it was off to wrap the auxiliary function API. That was straightforward, although it has a lot of terminology, with my feedback resulting in the overview being added.

At this point all the pieces were in place for anyone to use APSW to get the most out of FTS5 from Python. But you had to build everything yourself. I would have to document how you do things like stop words, stemming, or synonyms. They are only in the region of 10 lines of code which was confirmation that the API was good, and easier to to just go ahead and implement.

The builtin trigram tokenizer was not aware of grapheme clusters, so I also had to add an ngram tokenizer. (It irritates me how some sites require typing 3 characters before they do completions, so I made sure any value can be used.) I also ended up adding library code to deal with argument parsing that all the tokenizers needed. And one to deal with stripping accents and case folding for any other tokenizer so the functionality didn't have to be built in to each individually.

The SQLite website's own search uses its HTML documentation as what is fed to FTS5. There is no provided HTML tokenizer, and I saw several questions on the forum about indexing HTML content. Python has an HTML parser so I used that for the parsing, but it was also important to track the offsets in UTF8 of the source document to the tokens. This gets fun with entity and character references where there is no one to one mapping of offsets. That led to more C code to manage offset mapping, as pure Python code was too slow. Some people on the forum also wanted JSON tokenizers so I made one of those too.

I wanted a Python class to wrap a FTS5 table, especially so you could call methods and access properties rather than constructing SQL. It become necessary to parse the SQL making up the virtual table declaration such as getting column names and the various options. That required parsing code, especially as there are many ways quoting can be done. The tokenize argument is a quoted string containing quoted strings, so it gets hairy. (Far more quoting options work than the documentation says.)

I also wanted functionality to do query suggestion - if you make a spelling mistake in a query then it should suggest a better alternative. Even spelling something correctly could be a rare spelling and something more popular is appropriate. To implement that required being able to parse and manipulate (and create) queries. That required dealing with the finer points of the FTS5 query grammar, implicit AND, and should you have a PHRASES type (answer: no). It was then possible using the token information to implement query suggestion, which was fantastic to then use.

I had one remaining item on my wish list - more_like where given some existing rows, you can come up with additional ones that are like those. This makes it easy to implement infinite scrolling - just more_like what you have already shown each time. It was also great when testing on a large recipe database - given ingredients from one recipe it would suggest similar ones.

The query suggestion and more like are purely statistical - they have no understanding of meaning such as jogger being similar to runner. But they still work well enough. Solutions that do have that understanding require you to start measuring in gigabytes which going against the Lite part of SQLite.

I also had to write documentation, with a lot of things to cover. The final result when printed would be about 50 pages. Various reading time tools estimate about 50 minutes to read. I had to read it all multiple times for proofreading and editing.

Now that all is done I'll provide some statistics. Measuring lines of code is somewhat meaningful as being correlated with development time, complexity, number of bugs, testing effort etc. It is also as silly as measuring a painting by weight. Items marked 🆕 are what I had to write as described above, while the others provide context.

Lines of code Description
2,100 🆕Python test code for FTS5 and Unicode functionality
8,400 Python test code for everything else
1,100 🆕C code to glue FTS5 API to Python C API
1,200 C code to glue SQLite prepared statements and execution to Python C API
1,750 C code to glue SQLite virtual tables to Python C API
2,000 🆕C code to implement TR29 text segmentation, TR14 line breaking, case folding, accent stripping, UTF8 to str offset mapping, and various related functionality
2,250 C code to SQLite virtual file system (VFS) to Python C API
1,100 🆕Python code to provide API to unicode functionality, text wrapping, and similar, plus command line tool
860 🆕Python code for wrapping FTS5 table, all the tokenizers, argument parsing, and a command line tool.
1,200 🆕Python tooling code to download and parse Unicode data files and build the tables in the source which aren't included above.

Category: misc – Tags: unicode, python, apsw, sqlite


A small club: Implementing Unicode TR29 and TR14

I recently implemented Unicode Technical Report #29 which covers how to turn "text" into "letters", words, and sentences. I also ended up implementing Technical Report #14 which covers line breaking - good places to start text on the next line when the current line is getting too long.

My guess is that less than 10 developers have fully implemented both of these, and I have now joined that very small club.

November 2024 Unicode 16.0 came out - see the end for my thoughts

How hard could it be?

Unicode aims to have one way of representing human text for computers. Before its widespread adoption, different computer systems and even programs had completely different and very incompatible ways of representing text, and it was a mess to read, write, print, copy, and backup text.

This is an extraordinarily hard problem to solve. Are upper and lower case A` the same letter? What about bold, and italic? What about rotated? Fonts? Is ß a different letter or just a different way of drawing two lower case s? Ligatures? Is a lower case e in French without an accent the same as e in English? Is Roman numeral different than i? That barely scratches the surface of Western Europe, and now throw in the whole world with a rich diversity and history of writing. Thankfully experts have been working on this for several decades.

Codepoints

Unicode assigned a number (codepoint) per "letter". This doesn't work because some writing systems use a base letter (eg a consonant) and then add marks on the letter (eg vowels), and having one codepoint for each combination would be a very large number.

This led to combining codepoints that could modify a base, allowing adding accents, vowels, and other marks to represent human writing,

The fateful decision

There were two possible ways of representing these values - for example letter-e and combing-acute-accent

Modifier first

combing-acute-accent then letter-e

Base first

letter-e then combing-acute-accent

If modifier first had been chosen, then each letter as perceived by a human (ie with all the modifiers applied) would be easy to determine by computer code - accumulate modifiers until you find a base. You now have one user perceived letter (aka grapheme cluster).

Base first was chosen. That means code sees a codepoint, but has to keep looking ahead to see if there are more modifiers that apply to the base. It gets way more complicated with zero width joiners, variation selectors, regional indicators, and Indic syllabic categories to name a few.

The consequences

Because of the complexity, most programming languages just pretend that each codepoint is one "letter" and takes one unit of storage, but often worse (well worth a read) conflating byte encoding, grapheme clusters, and indexing. They then require third party libraries to get words, sentences, and line breaking.

I've been adding support for SQLite's full text search to APSW (my Python SQLite wrapper) and it became very clear very quickly that doing things properly was necessary.

SQLite has a unicode61 tokenizer but it works on individual codepoints and their categories on a version of Unicode from 2012 (to remain stable, and Lite). It doesn't implement any rules for words other than whitespace and punctuation which makes it unsuitable for the languages used by billions of people. It also doesn't handle regional indicators, variation selectors, zero width joiners, some emoji etc.

There is a standard International Components for Unicode library available that deals with all the Unicode rules, text segmentation, locale specific customisation, message generation, and localization. It is available as a 5MB Python binding, to the 5MB of code and 30MB of tables making up ICU. ICU isn't installed on every platform, and when it is you have no control over the version.

The ICU solution is also impractical to distribute for cross platform libraries like APSW, so I decided to implement the rules. My final self contained library was 500kb of code and data with no platform dependencies.

Lessons learned

There are three sets of rules covering grapheme cluster, words, and sentences, and another set for line breaking.

I ended up rewriting/refactoring my code six times, learning a lot as a result.

No errors

It isn't immediately apparent, but there is no sequence of codepoints that can be considered an error. No matter how absurd a sequence is supplied, the code has to handle it, even when multiple rules apply and contradict each other.

Test data

Unicode publish test data of around 2,000 tests each for grapheme clusters, word, and sentence breaks. This isn't perfect as I had cases of bugs being present but still passing all the tests. Without those tests, there is zero chance I could have written correct code. I was often puzzling over test failures and the list of rules to work out exactly what sequence were intended to apply. I wish they would add even more test rules with longer sequences and even more combinations that hit multiple rules.

Break points

Break points turned out to be an excellent way of determining boundary locations. You do need to do an analysis step on segments - eg does it contain letters or numerals if you are looking for words.

State machines

It looks like you can do a state machine and that is what you see in the Rust implementation as well as the Python grapheme package (abandoned?). It is possible to do so for the grapheme clusters, but not word and sentence. Adding some adaptive code could perhaps work. In the end I found that you end up with far too many states (word segmentation has 25 different rules), and that rules can contradict. For example you should break after LF but not before ZWJ so what happens if they are in sequence? Adaptive code ends up having to do look ahead, and look behind (either reversing rules or adding more state).

The biggest problem with a state machine is it becomes very hard to map the code and states back to the series of rules. And as the rules get adjusted over time, it will become harder to merge the changes in.

What worked for me

  • I did my initial exploration in Python. This was far quicker to iterate, debug, update data structures etc. At the end I then converted to C for significantly improved performance, but starting in C would have taken considerably longer.

  • The code matches the rules in the same order and expressed the same way as the specification. This makes it easier to verify they match, and allows for easier changes as rules change over time. (Complex state machines make that a lot harder.)

  • Use look ahead for matching (I did briefly try look behind). You can mark your current spot, and then advance trying to match a rule on succeeding codepoints. If a rule match fails, revert back to the mark, and try the next rule. This is in theory less efficient, but in practise little real world text is like this.

  • I had to implement more Unicode tables. Some rules want information outside of that specified in the TR14/29 tables. For example the rules want to know if a codepoint is extended pictographic, or in a particular category. While some of the information may be available separately in other platform libraries, it may not be the same Unicode version.

  • The tables and rules look like they map onto enumerations. However they work far better as a bitset. Most tables fit within 32 bits, although some require 64.

    You can combine all the necessary data into the single bitset. For example the grapheme cluster rules also want to know the Indic syllabic categories which is a separate table. I could combine that information with the grapheme cluster tables in the same bitset, avoiding a separate lookup.

    Bitsets also allows code like the following:

    if char & (L | V | LV | LVT): ...
    

    and:

    MidNumLetQ  = (MidNumLet | Single_Quote)
    
    if char & MidNumLetQ: ...
    

    ... instead of:

    if char == L || char == V || char == LV || ...
    
  • The various tables end up as generated code. (Every project I looked at used a Python based tool to do that.) I put a block before each table in a comment giving statistics on how popular each value was, how many lines there were in the table, and other similar information. This made it a lot easier to get an overview - the tables have thousands of rows - and also helped verify that code changes had the expected effect.

  • It was easy to add other tables. Because I also have terminal output I could add width information, and even a table for which Unicode version a codepoint was added.

Performance

Performance matters because these are used for full text indexing, which could be hundreds of megabytes if not gigabytes of text. There is no point in doing a C Python extension unless it performs well. So here is a benchmark. The source text is the UN Declaration of Human Rights which has been translated into 300 languages. All 300 are concatenated together producing a 5.5MB file. 60 codepoints used in many of the TR14/29 tests are combined and repeatedly shuffled with the source text, and appended until there are 50 million codepoints.

For each library, how long it takes to process that text and return each segment is measured, producing a codepoints per second result. Note that this is measuring both how long it takes to find each break location, as well as how long Python takes to return that substring/segment of text.

ie more codepoints per second is better.

Benchmarking apsw.unicode         unicode version 15.1
grapheme codepoints per second:    5,122,888    segments:  49,155,642
    word codepoints per second:   13,953,370    segments:   9,834,540
sentence codepoints per second:   93,699,093    segments:     909,603
    line codepoints per second:   16,840,718    segments:   9,548,987

Benchmarking uniseg               unicode version 15.0.0
grapheme codepoints per second:      567,514    segments:  49,149,753
    word codepoints per second:      570,943    segments:  23,617,670
sentence        EXCEPTION KeyError('OTHER')
    line codepoints per second:      528,677    segments:   9,545,722

Benchmarking grapheme             unicode version 13.0.0
grapheme codepoints per second:    1,777,689    segments:  49,163,151

Benchmarking pyicu                unicode version 15.1
grapheme codepoints per second:    3,778,235    segments:  49,155,642
    word codepoints per second:    8,375,000    segments:  20,050,729
sentence codepoints per second:   98,976,577    segments:     910,041
    line codepoints per second:   15,864,217    segments:   9,531,173

Here are the size measurements adding together Python bytecode and any stripped shared libraries.

 32kb    grapheme
144kb    uniseg
600kb    apsw.unicode (can be 200kb smaller)
 40mb    pyicu

Table unrolling

The tables end up with each row containing three integer fields:

  • Start codepoint
  • End codepoint
  • Category

Everybody stores them sorted, and then uses a binary search to find the row matching a codepoint. There are 2 to 4 thousand rows in each table meaning up to 12 comparisons are needed to find a particular row, which is negligible to a CPU.

The contents of the tables are known at compile time, and do not change. That led me to thinking that instead of generating tables, I could generate the comparisons instead. (This is similar to loop unrolling).

The resulting compiled code turned out to be smaller than the corresponding data tables, so I went with that. (It is even branch predictor friendly which tables are not.) The densest compilation results in code about 60% the size of tables, while 16 byte aligned is around 95% (but with lots of padding nops).

msvc defaults to densest and shaves 200kb off the binary size, while clang and gcc both do alignment. Examining the assembly code showed that it matched the comparisons.

Unicode 16 update

Unicode 16 came out in September, and I eagerly updated to it. The updates to codepoint tables were all easily handled. There were no rule changes to TR29 segmentation, and all tests passed.

TR14 line breaking did have rule changes which were fairly painful to implement in my C code. The compile test debug cycle is far slower and less ergonomic than almost anything else. The code is analogous to implementing regular expressions by hand while doing lookahead and backtracking, and you often need to figure out what particular rules did match but shouldn't have, or vice versa.

That leads me to one big lesson learned/suggestion for anyone else implementing the TR14 & TR29 rules:

Do not write code directly - always use an intermediate
representation that you generate code from, or interpret directly at
runtime.

It will make the development process a lot more pleasant, but also
allows tailoring the rules for different locales.

I should have realised this originally, but my Python implementation was productive. The Unicode consortium developers figured that out, and this is used for the C/Java library, while this is used for the Rust library. I am amused that they use a completely different syntax.

Category: misc – Tags: unicode, python, apsw

Contact me