User guide
In this section we first gives a basic example, then a more detailed description of the indexation process, and then a description on how to search into your data with the default searcher.
Each program is then describe in details:
- ft3_setup
- A small wizard for seting up the required tables, triggers and parameters
- ft3_indexer
- The indexer will create, update and maintain the actual text indexes
- ft3_searcher
- A basic text interface to query your database
- ft3d
- A unix daemon or windows service which acts as a web server. It support http request and reply in XML. The current schema is ft3-0.1.xsd.
- dmoz2sqlite
- This simple C++ program takes a dmoz.org RDF dump and put it in a sqlite3 database. It is small, doesn't use much memory and very fast. It is used to support the dmoz.org search engine demonstration
- wiki2sqlite
- This simple C++ program takes a wikipedia XML dump and put it in a sqlite3 database. It is small, doesn't use much memory and very fast.
An illustration of the process is given with a real word example: set up a search engine for the english part of dmoz.org. A very similar example is also available in the demo directory: your own private search engine for a wikipedia database :)
A basic example
Create and Fill the configuration table
Let suppose we have a test database call 'test.db3'. We have a table call "text" which contains 2 columns, "title" and "content". We want to index both columns.
FT3 comes with a small curses wizard which helps you to setup the required tables
First off all and if something goes wrong:
ft3_setup -c test.db3will clean you ft3 configuration.
Lunch the wizard in a shell or in console (cmd on windows)
ft3_setup test.db3You will be prompted for numbers to tell which tables and then which columns you want to index. You can also setup optional parameters. 'q' goes up one level.
If it is not straitforward enough, please drop me a note and I will copy/paste a complete session
Run the indexer
The indexer support 4 modes:
- The first mode is v0.2 version. It is a one shoot process which scan all documents from scratch and rebuild the index. This mode is deprecated and will be removed in 0.4.
- The second mode is also a one shoot process, but it scanned the journal generated by triggers and not the documents directly. It consume less disk space but it is slower for small database.
- The third mode is incremental. Each time you lunch the indexer, it scanned the journal to find modified, new or deleted entries and re-index them.
- and the fourth mode is used to merge the index generated in incremental mode into a large binary index.
Run the indexer
ft3_indexer -u 100 --db test.db3 --ft3 test_ft3will index all text describe in ft3_journal by chunk of 100 documents in database test.db3 and create a directory test_ft3 in which you will find 4 databases, one for words, one for scores, one for binary scores and one for proximity datas.
It is the recommended way to go for small databases.
Testing
First test with the basic searcher:ft3_searcher --db3 test.db3 --ft3 test_ft3or simply
ft3_searcher -r test
if you follow the convention that the databases are postfixed .db3 and _ft3.
This is a very simple interpreter. It understand queries, one per line. The syntax is classical web syntax, that is + force a word, - remove a word, "e;s are use for exact (i.e. phrase) match. The program will print a relevant part of each documents matching your query order by rank. Note that OR between expression is not supported
ft3_searcher
This program gives you an interface to all ft3 functions. You start it with the name of your database and the name of its companion:
ft3_searcher --db3 test.db3 --ft3 test_ft3A simple
ft3> .h
gives you a list of commands.
By default, it performs a full text search with the words you enter. The syntax is classical on the web:
ft3> word1 word2
looks for all indexed documents containing word1 and word2.
ft3> word1 word2 -word3
looks for all indexed documents containing word1 and word2 but without word3.
ft3> "word1 word2 word3"
looks for all indexed documents containing the phrase, i.e. word1 followed by word2 followed by word3.
ft3> title:word1
looks for all indexed documents containing the words word1 in the column title of your table
If you switch to SQL mode byft3> .xthe prompt change and you can issue SQL command and call directly ft3 functions.
sql>Some examples:
- Count the number of words:
sql> select count(*) from ft3_words;
- Identify the lang of a text:
sql> select ft3_language('This is an english phrase, it should be
will return
identified as such. The phrase is short and it may failed.)
1 'en'
- Interface to words stemmer:
sql> select ft3_stemmer('fr','mangera');
returns
1 mang
- Interface to double metaphone algorithm:
sql> select ft3_dmetaphone('francais');
Find all words with the same metaphone:
sql> select word from ft3_words where metaphone=ft3_dmetaphone('maria');
returns
1 mariah
2 maria
3 marie
- Interface to edit distance between two strings:
sql> select ft3_levenhstein('test','teste');
Indexer guide
The goal of the indexer is to fill 3 tables ft3_words
,
ft3_scores
and ft3_bscores
.
The first one is a dictionnary for words found into your documents
with some statistics associated with each word and some linguistics
informations like stems or metaphone. The second one is a inverted
index which gives a convenient way to know all documents containing a
word. By doing intersection on ft3_scores
, you can
easily find all documents containing many words. This table is a pure
SQL table and performance goes down at the table reachs a million rows
or something like this (I guess it depends of your hardware and how
much of the index is cache in RAM). The third table
ft3_bscores
contains the same datas as the second one
but in a BLOB for each word. This table takes less space on disk since
the BLOB in compressed and due to the highly repetitive patterns in
the inverted index the compression ratio is good. A call to
ft3_indexer
in merge mode, will transform ft3_scores
table into ft3_bscores
and cleanup the former.
Both tables contain other fields which will help both at query time and at sorting time.
The indexation process will be by steps, the first 3 steps can be manage by ft3_setup
:
- Create the companion databases and fill them with default schema
- Adjust parameters for your language
- Fill the configuration table with your requirements
- Lunch the indexer
- Do some basic checks
Create the companion database and fill it with ft3 schema
ft3_setup --db3 test.db3
will lunch an interactive session. You will be prompted for tables, columns and how you want the indexation to proceed.
You can also opt for a non interactive session:
- Edit a config file for example test_config.sqlite with ft3 indexes description i.e. SQL insert statements
- Create the ft3 schema required in your database:
sqlite3 test.db3 < $prefix/share/ft3/ft3_local.sqlite
- Fill the
ft3_config
table:sqlite3 test.db3 < test_config.sqlite
- Lunch the setup:
ft3_setup -s --db3 test.db3
It will create for you all required tables and triggers and do some basic checks
Lunch the indexer
You have different possibilities here. Let us suppose your data is small, say 10000 news articles. Lunch the indexer:
ft3_indexer --multipass 20000 -r test
will create the index in one pass (documents handle by slice of 20000). You are ready to search in the index
Insert some 100 new articles into the datase. The table ft3_journal will be filled but the documents are not parsed and the index is not up to date. Index again in incremental mode:
ft3_indexer --incremental -r test
We have now 2 ft3 indexes, one with the fresh documents and one with the documents indexed in the first pass. The decision to merge both indexes is yours depending on CPU/disks/usage requirements. A typical usage is to incrementally index every 10 minutes and to merge every night. By the way, you merge with:
ft3_indexer --merge -r test
DELETE and UPDATE statements are handle. DROP table is not handle currently and you have to remove the index by hand.
UPDATE is costly, the document is deleted and inserted again.
Do some basic checks
Have a look at the 10 more frequent words.SELECT * FROM ft3_words ORDER BY wordscounter DESC LIMIT 10;
Search guide
ft3d: web services search
ft3d is a web server which answers your requests and output results in XML. The XML schema is currently ft3-0.1.xsd. It is pretty straitforward. A first block describes the query, a second one gives informations on the process itself and the last one contains the answers.The dmoz.org search engine demonstration
Dmoz is famous directory on the internet. They provide a dump of there database in RDF (XML) version. Dmoz contains currently more than 4 millions entries. It is an intresting test for ft3 because the data is near the maximum I want to support whithout a distributed solution.
Getting datas
Find 10GO of space on your disk!
Dmoz is large (2GiB of text)!Download datas from the internet:
Download ODP datas with a browser or a tool like wget ou curl.
wget http://rdf.dmoz.org/rdf/content.u8.gz
and don't forget to decompress the file:
gunzip content.u8.gz
Filling a sqlite3 databases with dmoz.org content
In the demo subdirectory, a dmoz2sqlite tool has been build. Just try:
dmoz2sqlite content.u8 dmoz.db3
It takes a couple of hours of my laptop ...
Configuring and Indexing the database
Create the companion database and inject ft3 schema:
sqlite3 dmoz.ft3 < ${prefix}/share/ft3/ft3.sqlite
and setup the ft3 configuration:
sqlite3 dmoz.ft3 < ${prefix}/share/ft3/demo/demo_dmoz.config
You can look at the config: basically we index all documents and 4 fields url, title, description and category. Each field is declare with a different property.
Search with the basic searcher
Search from Apache/PHP
A demo script has been written in PHP5.1 and is available in the demo directory. The setup of an apache+php webserver is outside the scope of this document but is really easy and you will find plenty of help on internet.
Don't forget to start the daemon; choose a open port (here 8080):
ft3d -p 8080 -d dmoz.db3 -i dmoz.ft3
Configure the top of demo/dmoz.php: just declare where your database is located and on which port the daemon ft3d is running.
Configure apache or lighttpd daemon and restart it as root; usually something like should do it:
/etc/init.d/apache restart