FT3: a full text indexer and search engine

How it works

A two documents example

Let 2 documents A and B containing respectively string

I love New-York.
and
I love San Francisco.

We will split documents A and B word by word and get both a list of words and a list of positions in each document:

List of words

For each word we store a numeric id:
word wordid
i 1
love 2
new 3
york 4
san 5
francisco 6

List of scores

This is an inverted index, for each word, it gives the list of documents which contain the word. We also store a position of the word inside each document. We keep it sorted by wordid and then by docid.

wordiddocidposition
111
121
212
222
313
414
523
624

Get all documents containing word love

Write an SQL request like:
SELECT docid
      FROM ft3_scores
      WHERE wordid = ( SELECT oid FROM ft3_words where word = 'love' );

Get all documents containing words I and love

Write an SQL request like:
SELECT s1.docid
      FROM ft3_scores AS s1, ft3_scores AS s2
      WHERE s1.wordid = (SELECT oid FROM ft3_words where word = 'i')
      AND s2.wordid = (SELECT oid FROM ft3_words where word = 'love')
      AND s1.docid = s2.docid;

Get all documents containing word new followed by word york

Write an SQL request like:
SELECT s1.docid
      FROM ft3_scores AS s1, ft3_scores AS s2
      WHERE s1.wordid = (SELECT oid FROM words where word = 'new')
      AND s2.wordid = (SELECT oid FROM words where word = 'york')
      AND s1.docid = s2.docid
      AND s1.position+1 = s2.position;

More details: words handling

You have of course remark that the matching as been done on unfolded letters, that is word 'New' match word 'New' and word 'new'. This is a classical design which works well for english, but you can decide otherwise. Similary the city name 'New-York' has been split in two. You may agree or not.

In language with diacritics, you have to decide how you want the matching done:

In french, word 'œuvre' with the ligature oe may match 'oeuvre' for example. In german a 'ß' may match 'ss'.

FT3 provides an API for defining your own matching. This is also dependant of the underlying encoding. Currently we only support UTF-8 which is used by default inside sqlite3.

More details: stemmed words

A stem helps in providing a match for single/plural words or conjugated forms. It is expected that the stemmed form of a word will be the same for a noun plural or singular and that most forms of a verb will give the same stem. The algorithm which computes the stem is of course language dependant. In english, it use the well known Porter algorithm, and variants of it in other indo-europeen languages. The language supported are those of the library in contrib/stemmed directory.

For example in english, words 'cars' and 'car' will match. For exemple, in french words 'saisons' and 'saison' will have the same stem.

The algorithm is not perfect in particular for language with a lot of exceptions like french.

More details: similar words

In some cases, it is useful to have words related by one way or one other. Similar words can be words with the same phonetic for example. That is usefull with words like 'schwartzenegger' or 'jonasz'.

More details: enhance similar words

In some cases, we can do better than a one for one matching. For example:

triangle carre => triangle carré
and
mariah carre => mariah carey

The disambiguisation between the two forms of 'carre' can be done only within a context. Knowledge of your corpus will also help. FT3 proposes two algorithms for guessing which corrections is better in your request for the current context.

More details: stopwords

A stopword is a very frequent word which usually is meaningless in queries and in documents because it appeared into too much documents. They are usually skipped when documents are processed by FT3 (this is the default behaviour, it can be change on a table basis in configuration tables).

More details: ft3_scores

Ft3_Scores is a table containing for each word in each document, position information, and scoring information. There is currently 3 scores at your disposal:

  1. proximity:

    gives information relative to the distance between words. Usually words in queries which are near (in the same phrase) inside the document are more relevant.

  2. tfidf:

    is a classical scoring system. Basic idea is the following: a word is important in a document if it appears more than average (taken on document containing the word). For example, if word 'gargamelle' is not frequent in your corpus, probably the most interesting documents are those containing 'gargamelle' more than average in those documents. Second idea is that if you have several words in your queries, the most discriminant words are the least frequent.

  3. rank:

    a rank can be computed if there is some natural hierarchy between your documents. FT3 provides a way to handle hierarchical topics, datas like urls or directories.

The ranking function combines the 3 above scores to provide a ranking. This can be optimize if you have beforehand knowledge of your corpus. For example, looking for 'George Washingthon' should favor proximity because your are almost certainly intrested in the famous president. If you ask for 'California Washington', FT3 can suppose it is not the case because the association strength between this two words is weaker than in the previous query.

Home,   Sourceforge,   Download,   @2002-2006 P. Aubert