FT3: a full text indexer and search engine

Design

Algorithm

TFIDF

Proximity scores

Default ranking

Commented SQL schema

Every table start with a ft3_ prefix since sqlite3 does not support CREATE SCHEMA.
The schema version is used to detect old data format. Current version is 12.
CREATE TABLE ft3_version (
      version INTEGER
);
INSERT INTO ft3_version VALUES ( 12 );

Words related tables

A stopword is a very frequent word which by default is not indexed stopwords of course depend of current language for example: words 'de' and 'la' in french or words 'a' or 'the' in english.
CREATE TABLE ft3_stopwords (
      word TEXT NOT NULL
,     lang CHAR(2) DEFAULT '??' NOT NULL
);
Frequentwords are stopwords or words like html, fr, de ... whatever is very frequent and do not gives information.
CREATE TABLE ft3_frequentwords (
      word TEXT NOT NULL
,     lang CHAR(2) DEFAULT '??' NOT NULL
);
Table of words found inside your corpus
word:
asciified version of word
density:
inverse document frequency for this word
wordcounter:
number of occurence of word
docscounter:
number of documents where the word is at least one time present CONSTRAINT docscounter <= wordcounter
metaphone:
double metaphone of word
stemmed:
stem of word (depending of language of word)
CREATE TABLE ft3_words (
      recid INTEGER PRIMARY KEY
,     word TEXT UNIQUE NOT NULL
,     density REAL DEFAULT 0.0 NOT NULL
,     wordcounter INTEGER DEFAULT 1 NOT NULL
,     docscounter INTEGER DEFAULT 1 NOT NULL
,     metaphone TEXT DEFAULT '+'
,     stemmed TEXT DEFAULT ''
);

Document related tables

lang:
language detected by the analyser, if it failed default is "e;??"e;
last:
last date of indexation
status:
if 200 then document has been indexed, if 404 document has been deleted
md5:
currently unused
CREATE TABLE ft3_docinfos (
      docid INTEGER UNIQUE
,     lang CHAR(2) DEFAULT '??' NOT NULL
,     last INTEGER DEFAULT 0 NOT NULL
,     status INTEGER DEFAULT 200 NOT NULL
,     md5 TEXT
);

Inverted index tables

A table holding an inverted index, for each words it gives a list of documents where the word is present. This is the main table of ft3. It can be very large and then it is splitted into a number of sub-tables according a hash on wordid
wordid:
recid of word into ft3_words
docid:
global id of document (global id is computed via a function from local oid inside your database table and a number which caracterise your table (call tablegroup) later)
scores:
scores of word into document, the higher the score, the better the match will be rank
tfidf:
term freq over inverse document freq classical mesure of relevance
window:
window number a window is a sequence of word into a text. usually we split the text into window according to punctuation or other typographic bundary
winsz:
size of current window
winoff:
position into current window
flag:
holds the columns type in each table where the word occured
CREATE TABLE ft3_scores0 (
      wordid INTEGER NOT NULL
,     docid INTEGER NOT NULL
,     scores REAL
,     tfidf REAL
,     window INTEGER
,     winsz INTEGER
,     winoff INTEGER
,     flag INTEGER
,     stemmed INTEGER
);
CREATE VIEW ft3_all_scores AS
      SELECT * FROM ft3_scores0
     UNION ALL
      SELECT * FROM ft3_scores1
     UNION ALL
      ...
     UNION ALL
      SELECT * FROM ft3_scores15
     ;

Co-occurence Matrix

A matrix of co-occurences if 2 words id1 and id2 appeared inside the same window, then cnt increase by 1 the matrix is NOT symmetric
word1id:
id of first word
word12d:
id of second word
cnt:
number of window where both words are present (1 before 2)
relation:
hold the association strentgh between 1 & 2, the higher the relation the higher the probability of finding 1 near 2 (both in queries and in answers)
CREATE TABLE ft3_proximity (
      word1id INTEGER
,     word2id INTEGER
,     cnt INTEGER
,     relation REAL
);

Topics related tables

this table holds a list of topics if you did declare some of your column of type FLAG_CATEGORIES. This is really usefull if you index datas where there is a natural hierarchy; for example in a wiki, in urls, in directories of files or documents
categ:
name of the category (utf8 text)
level:
level in the hierarchy, top node as level 0
CREATE TABLE ft3_topics (
      recid INTEGER PRIMARY KEY
,     categ TEXT
,     level INTEGER
);
CREATE INDEX topics__categ__key ON ft3_topics(categ);
topics tree is stored via relation parent/child
CREATE TABLE ft3_topics_hierarchy (
      recid INTEGER PRIMARY KEY
,     parent INTEGER
,     child INTEGER
);

FT3 configuration related tables

Where to store information about database configuration keep in sync with c++ code
CREATE TABLE ft3_column_type (
      id INT
,     desc TEXT
);
-- url is either 1 or 2
INSERT INTO ft3_column_type VALUES ( 1, "host" );
      INSERT INTO ft3_column_type VALUES ( 2, "path" );
      INSERT INTO ft3_column_type VALUES ( 3, "text" );
      INSERT INTO ft3_column_type VALUES ( 4, "topic" );
      INSERT INTO ft3_column_type VALUES ( 5, "domain" );
      INSERT INTO ft3_column_type VALUES ( 6, "subdomain" );
      INSERT INTO ft3_column_type VALUES ( 7, "title" );
      INSERT INTO ft3_column_type VALUES ( 8, "description" );
      INSERT INTO ft3_column_type VALUES ( 9, "meta" );
      INSERT INTO ft3_column_type VALUES (10, "cluster" );
      INSERT INTO ft3_column_type VALUES (11, "keywords" );
      INSERT INTO ft3_column_type VALUES (12, "url" );
A config table which define behaviuor for ft3_indexer:
databasealias:
name of database open or attached into sqlite3
tablename:
name of the table to be indexed
tablegroup:
a tablegroup helps to group a list of tablename/columname together
pkeyname:
name of the primary key for your table, oid will do it for sqlite3 if you don't explicitly declare
onecolumnname:
name of the colum to be index, expected to be of TEXT affinity
flag:
describes some characteristics for helping the ranker currently we deal with text, url, catogories
lang:
do you want to enable language detection (current tool use trigrams and can be trained to recognize new languages if needed)
stem:
do you want to enable stemming for words? This depends on the language of the current document?
mixed:
can you afford to store both entries for each word and its stemmed form? Remember that ft3_scores is currently very large
weight:
a static weight to be applied in ranking procedure
CREATE TABLE ft3_config (
      recid INTEGER PRIMARY KEY
,     databasealias TEXT
,     tablename TEXT
,     tablegroup INT
,     pkeyname TEXT
,     columnname TEXT
,     flag INT
,     lang INT
,     stem INT
,     mixted INT
,     weigth INT
);
Home,   Sourceforge,   Download,   @2002-2006 P. Aubert