Design for 0.4
FT3 is based around the following ideas:
- Handle both small and large text databases. By large we means something wich fits well on a commodity HW. FT3 is tested and runs nicely on 10GiB text database. After that, a divide and conquer strategy has been used. It is not a goal to tackle TiB databases
- Two typical usage pattern are expected: either the number of search is on the same order as the write pattern in the database either the number of search if a lot larger than the write pattern. We try to adress both set of constraints.
- Respect cpu and disk space of host database, indexer can be quite cpu hungry.
- Use sqlite as much as possible, all datas stored inside one or more external sqlite3 files. Removing full text index have to be straitforward
- Be incremental at least partly. In my experience a B-Tree is not adapted for maintaining inverted indexes or at least it doesn't scale well. We use 2 indexes: a fresh one always current which is basically a SQLite table with a btree index. A consolidated index which is a stored as a potentially large BLOB per word. Due to the lack of a streaming interface for BLOB in current version of sqlite, you need to have enough RAM to hold the BLOBs for the words you have requested. This is usuallly OK but ... Since we have 2 indexes, FT3 provides a way to merge them.
- FT3 uses triggers to maintain a journal of the CRUD operations on the table where it maintains text indexes. This is not perfect and some operations may be missed, like a ALTER TABLE which are currently not handle.
- The user/administrator is responsible for lunching indexer/merger processus. I didn't want it to be synchronous with CRUD operations inside the database. If you want you can put the indexer on a crontab and lunch it every minutes or so.
- In order to handle a lot of requests, you just have to copy the binary compressed indexes stored into a file to a HW farm. It scales very well, at least to some extent; sqlite use nicely all memory you allocate to it and remember that memory is cheap now.
- Provides to the user classical algorithms and an easy way to add new ones. Provides support for simple stemming, stopwords, classical web syntax
- Be available both as a library with a stable API and as binaries
- Since version 3.3.7, sqlite propose
virtual
tables. FT3 will provide a module for creating and
manipulating virtual tables. It seems easy to get:
- a nice interface directly in sql, select command will recognize ft3 tables
- an interface to other services provided by FT3, such as similar words lookup
- Be database independant: currently it is a complete failure. FT3 is tied to sqlite for the storage backend. It will be easily to have it index other databases (this will be the focus of 0.5 version, I expect to support PostgreSQL, MySQL and ODBC via the SOCI library).