Design for 0.3

FT3 is based around the following ideas:

Handle both small and large text databases. By large we means something wich fits well on a commodity HW. FT3 is tested and runs nicely on 10GiB text database. After that, a divide and conquer strategy has been used. It is not a goal to tackle TiB databases
Two typical usage pattern are expected: either the number of search is on the same order as the write pattern in the database either the number of search if a lot larger than the write pattern. We try to adress both set of constraints.
Respect cpu and disk space of host database, indexer can be quite cpu hungry.
Use sqlite as much as possible, all datas stored inside one or more external sqlite3 files. Removing full text index have to be straitforward
Be incremental at least partly. In my experience a B-Tree is not adapted for maintaining inverted indexes or at least it doesn't scale well. We use 2 indexes: a fresh one always current which is basically a SQLite table with a btree index. A consolidated index which is a stored as a potentially large BLOB per word. Due to the lack of a streaming interface for BLOB in current version of sqlite, you need to have enough RAM to hold the BLOBs for the words you have requested. This is usuallly OK but ... Since we have 2 indexes, FT3 provides a way to merge them.
FT3 uses triggers to maintain a journal of the CRUD operations on the table where it maintains text indexes. This is not perfect and some operations may be missed, like a ALTER TABLE which are currently not handle.
The user/administrator is responsible for lunching indexer/merger processus. I didn't want it to be synchronous with CRUD operations inside the database. If you want you can put the indexer on a crontab and lunch it every minutes or so.
In order to handle a lot of requests, you just have to copy the binary compressed indexes stored into a file to a HW farm. It scales very well, at least to some extent; sqlite use nicely all memory you allocate to it and remember that memory is cheap now.
Provides to the user classical algorithms and an easy way to add new ones. Provides support for simple stemming, stopwords, classical web syntax
Be available both as a library with a stable API and as binaries
It would be nice to be intregrated at SQL level inside the database but it is currently not easy to do. I expect that the new proposal for text search inside sqlite will help in this regard. (I expect to create an interface via virtual tables introduced in sqlite 3.3.7
Be database independant: currently it is a complete failure. FT3 is tied to sqlite for the storage backend. It will be easily to have it index other databases (this will be the focus of 0.4 version, I expect to support PostgreSQL, MySQL and ODBC via the SOCI library).

Algorithm

TFIDF

Link to classical description of TFIDF (Aka Term Frequency over Inverse Document Frequency)

Proximity scores

Proximity scores are used to rank higher a document with all words from query in a close range.

This scores are computed and activated on ranking function if the a priori estimate that the words are correlated is high enough.

Links and hierarchy analysis

FT3 can takes into account a hierarchy of documents (like in a directory) and can favour documents near the root

In the same spirit, internal links inside the corpus are consolidated, this is not an automatic process, thus some coding is necessary to have it working nicely. An example is available on wikipedia database dump where inner links [[...]] and [...] are taken into account

Default ranking

After computing intersections between words index, you have to rank documents. We propose a simple default ranker. I guess you have to manually tunes the parameters for your corpus to get a better ranking. It is obviously corpus dependant.

Current score is computing like this:

Indexation score: for each word we store a tfidf score and a ranking score which is related links hierarchy analyses.

Search score: is a multiplication between indexation score, proximity scores and popularity score. Parameters are available to tune the function according to your corpus and personal taste.

Commented SQL schema

Every table start with a ft3_ prefix since sqlite3 does not support CREATE SCHEMA.
The schema version is used to detect old data format. Current version is 14.

CREATE TABLE ft3_version (
version INTEGER
);

INSERT INTO ft3_version VALUES ( 14 );

Words related tables

A stopword is a very frequent word which by default is not indexed stopwords of course depend of current language for example: words 'de' and 'la' in french or words 'a' or 'the' in english.

CREATE TABLE ft3_stopwords (
word TEXT NOT NULL
, lang CHAR(2) DEFAULT '??' NOT NULL
);

Frequentwords are stopwords or words like html, fr, de ... whatever is very frequent and do not gives information.

CREATE TABLE ft3_frequentwords (
word TEXT NOT NULL
, lang CHAR(2) DEFAULT '??' NOT NULL
);

Table of words found inside your corpus

word:: asciified version of word
density:: inverse document frequency for this word
wordcounter:: number of occurence of word
docscounter:: number of documents where the word is at least one time present CONSTRAINT docscounter <= wordcounter
metaphone:: double metaphone of word
stemmed:: stem of word (depending of language of word)

CREATE TABLE ft3_words (
      recid INTEGER PRIMARY KEY
,     word TEXT UNIQUE NOT NULL
,     density REAL DEFAULT 0.0 NOT NULL
,     wordcounter INTEGER DEFAULT 1 NOT NULL
,     docscounter INTEGER DEFAULT 1 NOT NULL
,     metaphone TEXT DEFAULT '+'
,     stemmed TEXT DEFAULT ''
);

Document related tables

lang:: language detected by the analyser, if it failed default is "e;??"e;
last:: last date of indexation
status:: if 200 then document has been indexed, if 404 document has been deleted
md5:: currently unused

CREATE TABLE ft3_docinfos (
      docid INTEGER UNIQUE
,     lang CHAR(2) DEFAULT '??' NOT NULL
,     last INTEGER DEFAULT 0 NOT NULL
,     status INTEGER DEFAULT 200 NOT NULL
,     md5 TEXT
);

Inverted index tables

A table holding an inverted index, for each words it gives a list of documents where the word is present. This is the main table of ft3. It can be very large and then it is splitted into a number of sub-tables according a hash on wordid

wordid:: recid of word into ft3_words
docid:: global id of document (global id is computed via a function from local oid inside your database table and a number which caracterise your table (call tablegroup) later)
scores:: scores of word into document, the higher the score, the better the match will be rank
tfidf:: term freq over inverse document freq classical mesure of relevance
window:: window number a window is a sequence of word into a text. usually we split the text into window according to punctuation or other typographic bundary
winsz:: size of current window
winoff:: position into current window
flag:: holds the columns type in each table where the word occured

CREATE TABLE ft3_scores0 (
      wordid INTEGER NOT NULL
,     docid INTEGER NOT NULL
,     scores REAL
,     tfidf REAL
,     window INTEGER
,     winsz INTEGER
,     winoff INTEGER
,     flag INTEGER
,     stemmed INTEGER
);

CREATE VIEW ft3_all_scores AS
      SELECT * FROM ft3_scores0
     UNION ALL
      SELECT * FROM ft3_scores1
     UNION ALL
      ...
     UNION ALL
      SELECT * FROM ft3_scores15
     ;

Co-occurence Matrix

A matrix of co-occurences if 2 words id1 and id2 appeared inside the same window, then cnt increase by 1 the matrix is NOT symmetric

word1id:: id of first word
word12d:: id of second word
cnt:: number of window where both words are present (1 before 2)
relation:: hold the association strentgh between 1 & 2, the higher the relation the higher the probability of finding 1 near 2 (both in queries and in answers)