Design for 0.3
FT3 is based around the following ideas:
- Handle both small and large text databases. By large we means something wich fits well on a commodity HW. FT3 is tested and runs nicely on 10GiB text database. After that, a divide and conquer strategy has been used. It is not a goal to tackle TiB databases
- Two typical usage pattern are expected: either the number of search is on the same order as the write pattern in the database either the number of search if a lot larger than the write pattern. We try to adress both set of constraints.
- Respect cpu and disk space of host database, indexer can be quite cpu hungry.
- Use sqlite as much as possible, all datas stored inside one or more external sqlite3 files. Removing full text index have to be straitforward
- Be incremental at least partly. In my experience a B-Tree is not adapted for maintaining inverted indexes or at least it doesn't scale well. We use 2 indexes: a fresh one always current which is basically a SQLite table with a btree index. A consolidated index which is a stored as a potentially large BLOB per word. Due to the lack of a streaming interface for BLOB in current version of sqlite, you need to have enough RAM to hold the BLOBs for the words you have requested. This is usuallly OK but ... Since we have 2 indexes, FT3 provides a way to merge them.
- FT3 uses triggers to maintain a journal of the CRUD operations on the table where it maintains text indexes. This is not perfect and some operations may be missed, like a ALTER TABLE which are currently not handle.
- The user/administrator is responsible for lunching indexer/merger processus. I didn't want it to be synchronous with CRUD operations inside the database. If you want you can put the indexer on a crontab and lunch it every minutes or so.
- In order to handle a lot of requests, you just have to copy the binary compressed indexes stored into a file to a HW farm. It scales very well, at least to some extent; sqlite use nicely all memory you allocate to it and remember that memory is cheap now.
- Provides to the user classical algorithms and an easy way to add new ones. Provides support for simple stemming, stopwords, classical web syntax
- Be available both as a library with a stable API and as binaries
- It would be nice to be intregrated at SQL level inside the database but it is currently not easy to do. I expect that the new proposal for text search inside sqlite will help in this regard. (I expect to create an interface via virtual tables introduced in sqlite 3.3.7
- Be database independant: currently it is a complete failure. FT3 is tied to sqlite for the storage backend. It will be easily to have it index other databases (this will be the focus of 0.4 version, I expect to support PostgreSQL, MySQL and ODBC via the SOCI library).
Algorithm
TFIDF
Link to classical description of TFIDF (Aka Term Frequency over Inverse Document Frequency)
Proximity scores
Proximity scores are used to rank higher a document with all words from query in a close range.
This scores are computed and activated on ranking function if the a priori estimate that the words are correlated is high enough.
Links and hierarchy analysis
FT3 can takes into account a hierarchy of documents (like in a directory) and can favour documents near the root
In the same spirit, internal links inside the corpus are consolidated, this is not an automatic process, thus some coding is necessary to have it working nicely. An example is available on wikipedia database dump where inner links [[...]] and [...] are taken into account
Default ranking
After computing intersections between words index, you have to rank documents. We propose a simple default ranker. I guess you have to manually tunes the parameters for your corpus to get a better ranking. It is obviously corpus dependant.
Current score is computing like this:
Indexation score: for each word we store a tfidf score and a ranking score which is related links hierarchy analyses.
Search score: is a multiplication between indexation score, proximity scores and popularity score. Parameters are available to tune the function according to your corpus and personal taste.
Commented SQL schema
Every table start with aft3_prefix since sqlite3 does not support
CREATE SCHEMA.
The schema version is used to detect old data format. Current version is 14.
CREATE TABLE ft3_version (
version INTEGER
);
INSERT INTO ft3_version VALUES ( 14 );
Words related tables
A stopword is a very frequent word which by default is not indexed stopwords of course depend of current language for example: words 'de' and 'la' in french or words 'a' or 'the' in english.CREATE TABLE ft3_stopwords (Frequentwords are stopwords or words like html, fr, de ... whatever is very frequent and do not gives information.
word TEXT NOT NULL
, lang CHAR(2) DEFAULT '??' NOT NULL
);
CREATE TABLE ft3_frequentwords (Table of words found inside your corpus
word TEXT NOT NULL
, lang CHAR(2) DEFAULT '??' NOT NULL
);
- word:
- asciified version of word
- density:
- inverse document frequency for this word
- wordcounter:
- number of occurence of word
- docscounter:
- number of documents where the word is at least one time present CONSTRAINT docscounter <= wordcounter
- metaphone:
- double metaphone of word
- stemmed:
- stem of word (depending of language of word)
CREATE TABLE ft3_words (
recid INTEGER PRIMARY KEY
, word TEXT UNIQUE NOT NULL
, density REAL DEFAULT 0.0 NOT NULL
, wordcounter INTEGER DEFAULT 1 NOT NULL
, docscounter INTEGER DEFAULT 1 NOT NULL
, metaphone TEXT DEFAULT '+'
, stemmed TEXT DEFAULT ''
);
Document related tables
- lang:
- language detected by the analyser, if it failed default is "e;??"e;
- last:
- last date of indexation
- status:
- if 200 then document has been indexed, if 404 document has been deleted
- md5:
- currently unused
CREATE TABLE ft3_docinfos (
docid INTEGER UNIQUE
, lang CHAR(2) DEFAULT '??' NOT NULL
, last INTEGER DEFAULT 0 NOT NULL
, status INTEGER DEFAULT 200 NOT NULL
, md5 TEXT
);
Inverted index tables
A table holding an inverted index, for each words it gives a list of documents where the word is present. This is the main table of ft3. It can be very large and then it is splitted into a number of sub-tables according a hash on wordid- wordid:
- recid of word into ft3_words
- docid:
- global id of document (global id is computed via a function from local oid inside your database table and a number which caracterise your table (call tablegroup) later)
- scores:
- scores of word into document, the higher the score, the better the match will be rank
- tfidf:
- term freq over inverse document freq classical mesure of relevance
- window:
- window number a window is a sequence of word into a text. usually we split the text into window according to punctuation or other typographic bundary
- winsz:
- size of current window
- winoff:
- position into current window
- flag:
- holds the columns type in each table where the word occured
CREATE TABLE ft3_scores0 (
wordid INTEGER NOT NULL
, docid INTEGER NOT NULL
, scores REAL
, tfidf REAL
, window INTEGER
, winsz INTEGER
, winoff INTEGER
, flag INTEGER
, stemmed INTEGER
);
CREATE VIEW ft3_all_scores AS
SELECT * FROM ft3_scores0
UNION ALL
SELECT * FROM ft3_scores1
UNION ALL
...
UNION ALL
SELECT * FROM ft3_scores15
;
Co-occurence Matrix
A matrix of co-occurences if 2 words id1 and id2 appeared inside the same window, then cnt increase by 1 the matrix is NOT symmetric- word1id:
- id of first word
- word12d:
- id of second word
- cnt:
- number of window where both words are present (1 before 2)
- relation:
- hold the association strentgh between 1 & 2, the higher the relation the higher the probability of finding 1 near 2 (both in queries and in answers)
- categ:
- name of the category (utf8 text)
- level:
- level in the hierarchy, top node as level 0
- databasealias:
- name of database open or attached into sqlite3
- tablename:
- name of the table to be indexed
- tablegroup:
- a tablegroup helps to group a list of tablename/columname together
- pkeyname:
- name of the primary key for your table, oid will do it for sqlite3 if you don't explicitly declare
- onecolumnname:
- name of the colum to be index, expected to be of TEXT affinity
- flag:
- describes some characteristics for helping the ranker currently we deal with text, url, catogories
- lang:
- do you want to enable language detection (current tool use trigrams and can be trained to recognize new languages if needed)
- stem:
- do you want to enable stemming for words? This depends on the language of the current document?
- mixed:
- can you afford to store both entries for each word and its stemmed form? Remember that ft3_scores is currently very large
- weight:
- a static weight to be applied in ranking procedure
CREATE TABLE ft3_proximity (
word1id INTEGER
, word2id INTEGER
, cnt INTEGER
, relation REAL
);
Topics related tables
this table holds a list of topics if you did declare some of your column of type FLAG_CATEGORIES. This is really usefull if you index datas where there is a natural hierarchy; for example in a wiki, in urls, in directories of files or documentsCREATE TABLE ft3_topics (
recid INTEGER PRIMARY KEY
, categ TEXT
, level INTEGER
);
CREATE INDEX topics__categ__key ON ft3_topics(categ);topics tree is stored via relation parent/child
CREATE TABLE ft3_topics_hierarchy (
recid INTEGER PRIMARY KEY
, parent INTEGER
, child INTEGER
);
FT3 configuration related tables
Where to store information about database configuration keep in sync with c++ codeCREATE TABLE ft3_column_type (-- url is either 1 or 2
id INT
, desc TEXT
);
INSERT INTO ft3_column_type VALUES ( 1, "host" );A config table which define behaviuor for ft3_indexer:
INSERT INTO ft3_column_type VALUES ( 2, "path" );
INSERT INTO ft3_column_type VALUES ( 3, "text" );
INSERT INTO ft3_column_type VALUES ( 4, "topic" );
INSERT INTO ft3_column_type VALUES ( 5, "domain" );
INSERT INTO ft3_column_type VALUES ( 6, "subdomain" );
INSERT INTO ft3_column_type VALUES ( 7, "title" );
INSERT INTO ft3_column_type VALUES ( 8, "description" );
INSERT INTO ft3_column_type VALUES ( 9, "meta" );
INSERT INTO ft3_column_type VALUES (10, "cluster" );
INSERT INTO ft3_column_type VALUES (11, "keywords" );
INSERT INTO ft3_column_type VALUES (12, "url" );
CREATE TABLE ft3_config (
recid INTEGER PRIMARY KEY
, databasealias TEXT
, tablename TEXT
, tablegroup INT
, pkeyname TEXT
, columnname TEXT
, flag INT
, lang INT
, stem INT
, mixted INT
, weigth INT
);