FT3: a full text indexer and search engine

Design for 0.3

FT3 is based around the following ideas:

Algorithm

TFIDF

Link to classical description of TFIDF (Aka Term Frequency over Inverse Document Frequency)

Proximity scores

Proximity scores are used to rank higher a document with all words from query in a close range.

This scores are computed and activated on ranking function if the a priori estimate that the words are correlated is high enough.

Links and hierarchy analysis

FT3 can takes into account a hierarchy of documents (like in a directory) and can favour documents near the root

In the same spirit, internal links inside the corpus are consolidated, this is not an automatic process, thus some coding is necessary to have it working nicely. An example is available on wikipedia database dump where inner links [[...]] and [...] are taken into account

Default ranking

After computing intersections between words index, you have to rank documents. We propose a simple default ranker. I guess you have to manually tunes the parameters for your corpus to get a better ranking. It is obviously corpus dependant.

Current score is computing like this:

Indexation score: for each word we store a tfidf score and a ranking score which is related links hierarchy analyses.

Search score: is a multiplication between indexation score, proximity scores and popularity score. Parameters are available to tune the function according to your corpus and personal taste.

Commented SQL schema

Every table start with a ft3_ prefix since sqlite3 does not support CREATE SCHEMA.
The schema version is used to detect old data format. Current version is 14.
CREATE TABLE ft3_version (
      version INTEGER
);
INSERT INTO ft3_version VALUES ( 14 );

Words related tables

A stopword is a very frequent word which by default is not indexed stopwords of course depend of current language for example: words 'de' and 'la' in french or words 'a' or 'the' in english.
CREATE TABLE ft3_stopwords (
      word TEXT NOT NULL
,     lang CHAR(2) DEFAULT '??' NOT NULL
);
Frequentwords are stopwords or words like html, fr, de ... whatever is very frequent and do not gives information.
CREATE TABLE ft3_frequentwords (
      word TEXT NOT NULL
,     lang CHAR(2) DEFAULT '??' NOT NULL
);
Table of words found inside your corpus
word:
asciified version of word
density:
inverse document frequency for this word
wordcounter:
number of occurence of word
docscounter:
number of documents where the word is at least one time present CONSTRAINT docscounter <= wordcounter
metaphone:
double metaphone of word
stemmed:
stem of word (depending of language of word)
CREATE TABLE ft3_words (
      recid INTEGER PRIMARY KEY
,     word TEXT UNIQUE NOT NULL
,     density REAL DEFAULT 0.0 NOT NULL
,     wordcounter INTEGER DEFAULT 1 NOT NULL
,     docscounter INTEGER DEFAULT 1 NOT NULL
,     metaphone TEXT DEFAULT '+'
,     stemmed TEXT DEFAULT ''
);

Document related tables

lang:
language detected by the analyser, if it failed default is "e;??"e;
last:
last date of indexation
status:
if 200 then document has been indexed, if 404 document has been deleted
md5:
currently unused
CREATE TABLE ft3_docinfos (
      docid INTEGER UNIQUE
,     lang CHAR(2) DEFAULT '??' NOT NULL
,     last INTEGER DEFAULT 0 NOT NULL
,     status INTEGER DEFAULT 200 NOT NULL
,     md5 TEXT
);

Inverted index tables

A table holding an inverted index, for each words it gives a list of documents where the word is present. This is the main table of ft3. It can be very large and then it is splitted into a number of sub-tables according a hash on wordid
wordid:
recid of word into ft3_words
docid:
global id of document (global id is computed via a function from local oid inside your database table and a number which caracterise your table (call tablegroup) later)
scores:
scores of word into document, the higher the score, the better the match will be rank
tfidf:
term freq over inverse document freq classical mesure of relevance
window:
window number a window is a sequence of word into a text. usually we split the text into window according to punctuation or other typographic bundary
winsz:
size of current window
winoff:
position into current window
flag:
holds the columns type in each table where the word occured
CREATE TABLE ft3_scores0 (
      wordid INTEGER NOT NULL
,     docid INTEGER NOT NULL
,     scores REAL
,     tfidf REAL
,     window INTEGER
,     winsz INTEGER
,     winoff INTEGER
,     flag INTEGER
,     stemmed INTEGER
);
CREATE VIEW ft3_all_scores AS
      SELECT * FROM ft3_scores0
     UNION ALL
      SELECT * FROM ft3_scores1
     UNION ALL
      ...
     UNION ALL
      SELECT * FROM ft3_scores15
     ;

Co-occurence Matrix

A matrix of co-occurences if 2 words id1 and id2 appeared inside the same window, then cnt increase by 1 the matrix is NOT symmetric
word1id:
id of first word
word12d:
id of second word
cnt:
number of window where both words are present (1 before 2)
relation:
hold the association strentgh between 1 & 2, the higher the relation the higher the probability of finding 1 near 2 (both in queries and in answers)
CREATE TABLE ft3_proximity (
      word1id INTEGER
,     word2id INTEGER
,     cnt INTEGER
,     relation REAL
);

Topics related tables

this table holds a list of topics if you did declare some of your column of type FLAG_CATEGORIES. This is really usefull if you index datas where there is a natural hierarchy; for example in a wiki, in urls, in directories of files or documents
categ:
name of the category (utf8 text)
level:
level in the hierarchy, top node as level 0
CREATE TABLE ft3_topics (
      recid INTEGER PRIMARY KEY
,     categ TEXT
,     level INTEGER
);
CREATE INDEX topics__categ__key ON ft3_topics(categ);
topics tree is stored via relation parent/child
CREATE TABLE ft3_topics_hierarchy (
      recid INTEGER PRIMARY KEY
,     parent INTEGER
,     child INTEGER
);

FT3 configuration related tables

Where to store information about database configuration keep in sync with c++ code
CREATE TABLE ft3_column_type (
      id INT
,     desc TEXT
);
-- url is either 1 or 2
INSERT INTO ft3_column_type VALUES ( 1, "host" );
      INSERT INTO ft3_column_type VALUES ( 2, "path" );
      INSERT INTO ft3_column_type VALUES ( 3, "text" );
      INSERT INTO ft3_column_type VALUES ( 4, "topic" );
      INSERT INTO ft3_column_type VALUES ( 5, "domain" );
      INSERT INTO ft3_column_type VALUES ( 6, "subdomain" );
      INSERT INTO ft3_column_type VALUES ( 7, "title" );
      INSERT INTO ft3_column_type VALUES ( 8, "description" );
      INSERT INTO ft3_column_type VALUES ( 9, "meta" );
      INSERT INTO ft3_column_type VALUES (10, "cluster" );
      INSERT INTO ft3_column_type VALUES (11, "keywords" );
      INSERT INTO ft3_column_type VALUES (12, "url" );
A config table which define behaviuor for ft3_indexer:
databasealias:
name of database open or attached into sqlite3
tablename:
name of the table to be indexed
tablegroup:
a tablegroup helps to group a list of tablename/columname together
pkeyname:
name of the primary key for your table, oid will do it for sqlite3 if you don't explicitly declare
onecolumnname:
name of the colum to be index, expected to be of TEXT affinity
flag:
describes some characteristics for helping the ranker currently we deal with text, url, catogories
lang:
do you want to enable language detection (current tool use trigrams and can be trained to recognize new languages if needed)
stem:
do you want to enable stemming for words? This depends on the language of the current document?
mixed:
can you afford to store both entries for each word and its stemmed form? Remember that ft3_scores is currently very large
weight:
a static weight to be applied in ranking procedure
CREATE TABLE ft3_config (
      recid INTEGER PRIMARY KEY
,     databasealias TEXT
,     tablename TEXT
,     tablegroup INT
,     pkeyname TEXT
,     columnname TEXT
,     flag INT
,     lang INT
,     stem INT
,     mixted INT
,     weigth INT
);
Home,   Sourceforge,   Download,   @2002-2006 P. Aubert