FT3: a full text indexer and search engine

User guide

In this section we first gives a basic example, then a more detailed description of the indexation process, and then a description on how to search into your data with the default searcher. An illustration of the process is given with a real word example: set up a search engine for the english part of dmoz.org

A basic example

Create and Fill the configuration table

Let suppose we have a test database call 'test.db3'. We have a table call "text" which contains 2 columns, "title" and "content". We want to index both columns.

First create an empty sqlite3 database which holds all datas produces by ft3; my current convention is to name it with a '.ft3' as a suffix. ft3.sqlite is a text file holding ft3 SQL schema. On unix platform, it is install in $prefix/share/ft3/ft3.sqlite

sqlite3 test.ft3 < ft3.sqlite

We just need to fill table ft3_config, insert the datas with sqlite3 or another frontend:

INSERT INTO ft3_config VALUES ( 1,
      "test", "text", 1, "oid", "title",
      7, 1, 0, 0, 0
);
      INSERT INTO ft3_config VALUES ( 2,
      "test", "text", 1, "oid", "content",
      3, 1, 0, 0, 0
);

First line explain to ft3 that the database is named "test", the table "text", the column "title". "oid" is the way under sqlite to get internal row id.

In the first insert, 7 denotes a title and in the second insert, 3 denotes a classical text. A complete list of possible values is describe in the SQL schema which is commented below

Run the indexer

Since the indexer is pretty fast, it is a oneshot process. It will be incremental in the next release with sql triggers.

ft3_indexer test.db test.ft3

Every 10000 words/documents, the process will print some informations, how fast it goes, estimated memory consumption, ... to give you an estimate on when it will be finished.

Testing

First test with the basic searcher:
ft3_searcher test.db test.ft3
or simply
ft3_searcher test

if you follow the convention that the databases are postfixed .db3 and .ft3.

This is a very simple interpreter. It understand queries, one per line. The syntax is classical web syntax, that is + force a word, - remove a word, "e; is use for exact match. The program will print a relevant part of each documents matching your query order by rank.

ft3_searcher

This program gives you an interface to all ft3 functions. You start it with the name of your database and the name of its companion:

ft3_searcher my.db my.ft3
A simple
ft3> .h

gives you a list of commands.

By default, it performs a full text search with the words you enter. The syntax is classical on the web:

ft3> word1 word2

looks for all indexed documents containing word1 and word2.

ft3> word1 word2 -word3

looks for all indexed documents containing word1 and word2 but without word3.

ft3> "word1 word2 word3"

looks for all indexed documents containing the phrase word1 followed by word2 followed by word3.

If you switch to SQL mode by
ft3> .x
the prompt change and you can issue SQL command and call directly ft3 functions.
sql>
Some examples:
  • Count the number of words:
    sql> select count(*) from ft3_words;
  • Identify the lang of a text:
    sql> select ft3_language('This is an english phrase, it should be
          identified as such. The phrase is short and it may failed.)
    will return
    1 'en'
  • Interface to words stemmer:
    sql> select ft3_stemmer('fr','francais);
    returns
    1 mariah
          2 maria
          3 marie
  • Interface to double metaphone algorithm:
    sql> select ft3_dmetaphone('francais');
    Find all words with the same metaphone:
    sql> select word from ft3_words where metaphone=ft3_dmetaphone('maria');
  • Interface to edit distance between two strings:
    sql> select ft3_levenhstein('test','teste');

Indexer guide

The goal of the indexer is to fill the 2 tables ft3_words and ft3_scores. The first one is a dictionnary for words found into your documents. The second one is a inverted index which gives a convenient way to know all documents containing a word. By doing intersection on ft3_scores, you can easily find all documents containing many words.

Both tables contain other fields which will help both at query time and at sorting time.

The indexation process will be followed by steps:

  1. Create the companion database and fill it with default schema
  2. Adjust parameters for your language
  3. Fill the configuration table with your requirements
  4. Lunch the indexer
  5. Do some basic checks

Create the companion database and fill it with ft3 schema

Adjust parameters for your language

Fill the configuration table with your requirements

Lunch the indexer

Do some basic checks

Have a look at the 10 more frequent words.
SELECT * FROM ft3_words ORDER BY wordscounter DESC LIMIT 10;

Search guide

The dmoz.org search engine demonstration

Dmoz is famous directory on the internet. They provide a dump of there database in RDF (XML) version. Dmoz contains currently more than 4 millions entries. It is an intresting test for ft3 because the data is near the maximum I want to support whithout a distributed solution.

Getting datas

Find 200MO of space on your disk!

Dmoz is large (20MO of text)!

Download datas from the internet:

Download ODP datas with a browser or a tool like wget ou curl.

wget http://rdf.dmoz.org/rdf/content.u8.gz

and don't forget to decompress the file:

gunzip content.u8.gz

Filling a sqlite3 databases with dmoz.org content

In the demo subdirectory, a dmoz2sqlite tool has been build. Just try:

dmoz2sqlite content.u8 dmoz.db3

It takes a couple of hours of my laptop ...

Configuring and Indexing the database

Create the companion database and inject ft3 schema:

sqlite3 dmoz.ft3 < ${prefix}/share/ft3/ft3.sqlite

and setup the ft3 configuration:

sqlite3 dmoz.ft3 < ${prefix}/share/ft3/demo/demo_dmoz.config

You can look at the config: basically we index all documents and 4 fields url, title, description and category. Each field is declare with a different property.

Search with the basic searcher

Search from Apache/PHP

Home,   Sourceforge,   Download,   @2002-2006 P. Aubert