Design
Algorithm
TFIDF
Proximity scores
Default ranking
Commented SQL schema
Every table start with aft3_prefix since sqlite3 does not support
CREATE SCHEMA.
The schema version is used to detect old data format. Current version is 12.
CREATE TABLE ft3_version (
version INTEGER
);
INSERT INTO ft3_version VALUES ( 12 );
Words related tables
A stopword is a very frequent word which by default is not indexed stopwords of course depend of current language for example: words 'de' and 'la' in french or words 'a' or 'the' in english.CREATE TABLE ft3_stopwords (Frequentwords are stopwords or words like html, fr, de ... whatever is very frequent and do not gives information.
word TEXT NOT NULL
, lang CHAR(2) DEFAULT '??' NOT NULL
);
CREATE TABLE ft3_frequentwords (Table of words found inside your corpus
word TEXT NOT NULL
, lang CHAR(2) DEFAULT '??' NOT NULL
);
- word:
- asciified version of word
- density:
- inverse document frequency for this word
- wordcounter:
- number of occurence of word
- docscounter:
- number of documents where the word is at least one time present CONSTRAINT docscounter <= wordcounter
- metaphone:
- double metaphone of word
- stemmed:
- stem of word (depending of language of word)
CREATE TABLE ft3_words (
recid INTEGER PRIMARY KEY
, word TEXT UNIQUE NOT NULL
, density REAL DEFAULT 0.0 NOT NULL
, wordcounter INTEGER DEFAULT 1 NOT NULL
, docscounter INTEGER DEFAULT 1 NOT NULL
, metaphone TEXT DEFAULT '+'
, stemmed TEXT DEFAULT ''
);
Document related tables
- lang:
- language detected by the analyser, if it failed default is "e;??"e;
- last:
- last date of indexation
- status:
- if 200 then document has been indexed, if 404 document has been deleted
- md5:
- currently unused
CREATE TABLE ft3_docinfos (
docid INTEGER UNIQUE
, lang CHAR(2) DEFAULT '??' NOT NULL
, last INTEGER DEFAULT 0 NOT NULL
, status INTEGER DEFAULT 200 NOT NULL
, md5 TEXT
);
Inverted index tables
A table holding an inverted index, for each words it gives a list of documents where the word is present. This is the main table of ft3. It can be very large and then it is splitted into a number of sub-tables according a hash on wordid- wordid:
- recid of word into ft3_words
- docid:
- global id of document (global id is computed via a function from local oid inside your database table and a number which caracterise your table (call tablegroup) later)
- scores:
- scores of word into document, the higher the score, the better the match will be rank
- tfidf:
- term freq over inverse document freq classical mesure of relevance
- window:
- window number a window is a sequence of word into a text. usually we split the text into window according to punctuation or other typographic bundary
- winsz:
- size of current window
- winoff:
- position into current window
- flag:
- holds the columns type in each table where the word occured
CREATE TABLE ft3_scores0 (
wordid INTEGER NOT NULL
, docid INTEGER NOT NULL
, scores REAL
, tfidf REAL
, window INTEGER
, winsz INTEGER
, winoff INTEGER
, flag INTEGER
, stemmed INTEGER
);
CREATE VIEW ft3_all_scores AS
SELECT * FROM ft3_scores0
UNION ALL
SELECT * FROM ft3_scores1
UNION ALL
...
UNION ALL
SELECT * FROM ft3_scores15
;
Co-occurence Matrix
A matrix of co-occurences if 2 words id1 and id2 appeared inside the same window, then cnt increase by 1 the matrix is NOT symmetric- word1id:
- id of first word
- word12d:
- id of second word
- cnt:
- number of window where both words are present (1 before 2)
- relation:
- hold the association strentgh between 1 & 2, the higher the relation the higher the probability of finding 1 near 2 (both in queries and in answers)
- categ:
- name of the category (utf8 text)
- level:
- level in the hierarchy, top node as level 0
- databasealias:
- name of database open or attached into sqlite3
- tablename:
- name of the table to be indexed
- tablegroup:
- a tablegroup helps to group a list of tablename/columname together
- pkeyname:
- name of the primary key for your table, oid will do it for sqlite3 if you don't explicitly declare
- onecolumnname:
- name of the colum to be index, expected to be of TEXT affinity
- flag:
- describes some characteristics for helping the ranker currently we deal with text, url, catogories
- lang:
- do you want to enable language detection (current tool use trigrams and can be trained to recognize new languages if needed)
- stem:
- do you want to enable stemming for words? This depends on the language of the current document?
- mixed:
- can you afford to store both entries for each word and its stemmed form? Remember that ft3_scores is currently very large
- weight:
- a static weight to be applied in ranking procedure
CREATE TABLE ft3_proximity (
word1id INTEGER
, word2id INTEGER
, cnt INTEGER
, relation REAL
);
Topics related tables
this table holds a list of topics if you did declare some of your column of type FLAG_CATEGORIES. This is really usefull if you index datas where there is a natural hierarchy; for example in a wiki, in urls, in directories of files or documentsCREATE TABLE ft3_topics (
recid INTEGER PRIMARY KEY
, categ TEXT
, level INTEGER
);
CREATE INDEX topics__categ__key ON ft3_topics(categ);topics tree is stored via relation parent/child
CREATE TABLE ft3_topics_hierarchy (
recid INTEGER PRIMARY KEY
, parent INTEGER
, child INTEGER
);
FT3 configuration related tables
Where to store information about database configuration keep in sync with c++ codeCREATE TABLE ft3_column_type (-- url is either 1 or 2
id INT
, desc TEXT
);
INSERT INTO ft3_column_type VALUES ( 1, "host" );A config table which define behaviuor for ft3_indexer:
INSERT INTO ft3_column_type VALUES ( 2, "path" );
INSERT INTO ft3_column_type VALUES ( 3, "text" );
INSERT INTO ft3_column_type VALUES ( 4, "topic" );
INSERT INTO ft3_column_type VALUES ( 5, "domain" );
INSERT INTO ft3_column_type VALUES ( 6, "subdomain" );
INSERT INTO ft3_column_type VALUES ( 7, "title" );
INSERT INTO ft3_column_type VALUES ( 8, "description" );
INSERT INTO ft3_column_type VALUES ( 9, "meta" );
INSERT INTO ft3_column_type VALUES (10, "cluster" );
INSERT INTO ft3_column_type VALUES (11, "keywords" );
INSERT INTO ft3_column_type VALUES (12, "url" );
CREATE TABLE ft3_config (
recid INTEGER PRIMARY KEY
, databasealias TEXT
, tablename TEXT
, tablegroup INT
, pkeyname TEXT
, columnname TEXT
, flag INT
, lang INT
, stem INT
, mixted INT
, weigth INT
);