User guide

In this section we first gives a basic example, then a more detailed description of the indexation process, and then a description on how to search into your data with the default searcher.

Each program is then describe in details:

ft3_setup: A small wizard for seting up the required tables, triggers and parameters
ft3_indexer: The indexer will create, update and maintain the actual text indexes
ft3_searcher: A basic text interface to query your database
ft3d: A unix daemon or windows service which acts as a web server. It support http request and reply in XML. The current schema is ft3-0.1.xsd.
dmoz2sqlite: This simple C++ program takes a dmoz.org RDF dump and put it in a sqlite3 database. It is small, doesn't use much memory and very fast. It is used to support the dmoz.org search engine demonstration
wiki2sqlite: This simple C++ program takes a wikipedia XML dump and put it in a sqlite3 database. It is small, doesn't use much memory and very fast.

An illustration of the process is given with a real word example: set up a search engine for the english part of dmoz.org. A very similar example is also available in the demo directory: your own private search engine for a wikipedia database :)

A basic example

Create and Fill the configuration table

Let suppose we have a test database call 'test.db3'. We have a table call "text" which contains 2 columns, "title" and "content". We want to index both columns.

FT3 comes with a small curses wizard which helps you to setup the required tables

First off all and if something goes wrong:

ft3_setup -c test.db3

will clean you ft3 configuration.

Lunch the wizard in a shell or in console (cmd on windows)

ft3_setup test.db3

You will be prompted for numbers to tell which tables and then which columns you want to index. You can also setup optional parameters. 'q' goes up one level.

If it is not straitforward enough, please drop me a note and I will copy/paste a complete session

Run the indexer

The indexer support 4 modes:

The first mode is v0.2 version. It is a one shoot process which scan all documents from scratch and rebuild the index. This mode is deprecated and will be removed in 0.4.
The second mode is also a one shoot process, but it scanned the journal generated by triggers and not the documents directly. It consume less disk space but it is slower for small database.
The third mode is incremental. Each time you lunch the indexer, it scanned the journal to find modified, new or deleted entries and re-index them.
and the fourth mode is used to merge the index generated in incremental mode into a large binary index.

Run the indexer

ft3_indexer -u 100 --db test.db3 --ft3 test_ft3

will index all text describe in ft3_journal by chunk of 100 documents in database test.db3 and create a directory test_ft3 in which you will find 4 databases, one for words, one for scores, one for binary scores and one for proximity datas.

It is the recommended way to go for small databases.

Testing

First test with the basic searcher:

ft3_searcher --db3 test.db3 --ft3 test_ft3

or simply

ft3_searcher -r test

if you follow the convention that the databases are postfixed .db3 and _ft3.

This is a very simple interpreter. It understand queries, one per line. The syntax is classical web syntax, that is + force a word, - remove a word, "e;s are use for exact (i.e. phrase) match. The program will print a relevant part of each documents matching your query order by rank. Note that OR between expression is not supported

ft3_searcher

This program gives you an interface to all ft3 functions. You start it with the name of your database and the name of its companion:

ft3_searcher --db3 test.db3 --ft3 test_ft3

A simple

ft3> .h

gives you a list of commands.

By default, it performs a full text search with the words you enter. The syntax is classical on the web:

ft3> word1 word2

looks for all indexed documents containing word1 and word2.

ft3> word1 word2 -word3

looks for all indexed documents containing word1 and word2 but without word3.

ft3> "word1 word2 word3"

looks for all indexed documents containing the phrase, i.e. word1 followed by word2 followed by word3.

ft3> title:word1

looks for all indexed documents containing the words word1 in the column title of your table

If you switch to SQL mode by

ft3> .x

the prompt change and you can issue SQL command and call directly ft3 functions.

sql>

Some examples:

Count the number of words:

sql> select count(*) from ft3_words;

Identify the lang of a text:

sql> select ft3_language('This is an english phrase, it should be
identified as such. The phrase is short and it may failed.)

will return

1 'en'

Interface to words stemmer:

sql> select ft3_stemmer('fr','mangera');

returns

1 mang

Interface to double metaphone algorithm:
sql> select ft3_dmetaphone('francais');

Find all words with the same metaphone:
sql> select word from ft3_words where metaphone=ft3_dmetaphone('maria');

returns
1 mariah
2 maria
3 marie
Interface to edit distance between two strings:
sql> select ft3_levenhstein('test','teste');

Indexer guide

The goal of the indexer is to fill 3 tables ft3_words, ft3_scores and ft3_bscores. The first one is a dictionnary for words found into your documents with some statistics associated with each word and some linguistics informations like stems or metaphone. The second one is a inverted index which gives a convenient way to know all documents containing a word. By doing intersection on ft3_scores, you can easily find all documents containing many words. This table is a pure SQL table and performance goes down at the table reachs a million rows or something like this (I guess it depends of your hardware and how much of the index is cache in RAM). The third table ft3_bscores contains the same datas as the second one but in a BLOB for each word. This table takes less space on disk since the BLOB in compressed and due to the highly repetitive patterns in the inverted index the compression ratio is good. A call to ft3_indexer in merge mode, will transform ft3_scores table into ft3_bscores and cleanup the former.

Both tables contain other fields which will help both at query time and at sorting time.

The indexation process will be by steps, the first 3 steps can be manage by ft3_setup:

Create the companion databases and fill them with default schema
Adjust parameters for your language
Fill the configuration table with your requirements
Lunch the indexer
Do some basic checks

Create the companion database and fill it with ft3 schema

ft3_setup --db3 test.db3

will lunch an interactive session. You will be prompted for tables, columns and how you want the indexation to proceed.

You can also opt for a non interactive session:

Edit a config file for example test_config.sqlite with ft3 indexes description i.e. SQL insert statements
Create the ft3 schema required in your database:
sqlite3 test.db3 < $prefix/share/ft3/ft3_local.sqlite
Fill the ft3_config table:
sqlite3 test.db3 < test_config.sqlite
Lunch the setup:
ft3_setup -s --db3 test.db3

It will create for you all required tables and triggers and do some basic checks

Lunch the indexer

You have different possibilities here. Let us suppose your data is small, say 10000 news articles. Lunch the indexer:

ft3_indexer --multipass 20000 -r test

will create the index in one pass (documents handle by slice of 20000). You are ready to search in the index

Insert some 100 new articles into the datase. The table ft3_journal will be filled but the documents are not parsed and the index is not up to date. Index again in incremental mode:

ft3_indexer --incremental -r test

We have now 2 ft3 indexes, one with the fresh documents and one with the documents indexed in the first pass. The decision to merge both indexes is yours depending on CPU/disks/usage requirements. A typical usage is to incrementally index every 10 minutes and to merge every night. By the way, you merge with:

ft3_indexer --merge -r test

DELETE and UPDATE statements are handle. DROP table is not handle currently and you have to remove the index by hand.

UPDATE is costly, the document is deleted and inserted again.

Do some basic checks

Have a look at the 10 more frequent words.

SELECT * FROM ft3_words ORDER BY wordscounter DESC LIMIT 10;

Search guide

ft3d: web services search

ft3d is a web server which answers your requests and output results in XML. The XML schema is currently ft3-0.1.xsd. It is pretty straitforward. A first block describes the query, a second one gives informations on the process itself and the last one contains the answers.

The dmoz.org search engine demonstration

Dmoz is famous directory on the internet. They provide a dump of there database in RDF (XML) version. Dmoz contains currently more than 4 millions entries. It is an intresting test for ft3 because the data is near the maximum I want to support whithout a distributed solution.

Getting datas

Find 10GO of space on your disk!

Dmoz is large (2GiB of text)!

Download datas from the internet:

Download ODP datas with a browser or a tool like wget ou curl.

wget http://rdf.dmoz.org/rdf/content.u8.gz

and don't forget to decompress the file:

gunzip content.u8.gz

Filling a sqlite3 databases with dmoz.org content

In the demo subdirectory, a dmoz2sqlite tool has been build. Just try:

dmoz2sqlite content.u8 dmoz.db3

It takes a couple of hours of my laptop ...

Configuring and Indexing the database

Create the companion database and inject ft3 schema:

sqlite3 dmoz.ft3 < ${prefix}/share/ft3/ft3.sqlite

and setup the ft3 configuration:

sqlite3 dmoz.ft3 < ${prefix}/share/ft3/demo/demo_dmoz.config

You can look at the config: basically we index all documents and 4 fields url, title, description and category. Each field is declare with a different property.

Search with the basic searcher

Search from Apache/PHP

A demo script has been written in PHP5.1 and is available in the demo directory. The setup of an apache+php webserver is outside the scope of this document but is really easy and you will find plenty of help on internet.

Don't forget to start the daemon; choose a open port (here 8080):

ft3d -p 8080 -d dmoz.db3 -i dmoz.ft3

Configure the top of demo/dmoz.php: just declare where your database is located and on which port the daemon ft3d is running.

Configure apache or lighttpd daemon and restart it as root; usually something like should do it:

/etc/init.d/apache restart