User guide
In this section we first gives a basic example, then a more detailed description of the indexation process, and then a description on how to search into your data with the default searcher. An illustration of the process is given with a real word example: set up a search engine for the english part of dmoz.org
A basic example
Create and Fill the configuration table
Let suppose we have a test database call 'test.db3'. We have a table call "text" which contains 2 columns, "title" and "content". We want to index both columns.
First create an empty sqlite3 database which holds all datas produces by ft3; my current convention is to name it with a '.ft3' as a suffix. ft3.sqlite is a text file holding ft3 SQL schema. On unix platform, it is install in $prefix/share/ft3/ft3.sqlite
sqlite3 test.ft3 < ft3.sqlite
We just need to fill table ft3_config
, insert the datas
with sqlite3 or another frontend:
INSERT INTO ft3_config VALUES ( 1,
"test", "text", 1, "oid", "title",
7, 1, 0, 0, 0
);
INSERT INTO ft3_config VALUES ( 2,
"test", "text", 1, "oid", "content",
3, 1, 0, 0, 0
);
First line explain to ft3 that the database is named "test", the table "text", the column "title". "oid" is the way under sqlite to get internal row id.
In the first insert, 7 denotes a title and in the second insert, 3 denotes a classical text. A complete list of possible values is describe in the SQL schema which is commented below
Run the indexer
Since the indexer is pretty fast, it is a oneshot process. It will be incremental in the next release with sql triggers.
ft3_indexer test.db test.ft3
Every 10000 words/documents, the process will print some informations, how fast it goes, estimated memory consumption, ... to give you an estimate on when it will be finished.
Testing
First test with the basic searcher:ft3_searcher test.db test.ft3or simply
ft3_searcher test
if you follow the convention that the databases are postfixed .db3 and .ft3.
This is a very simple interpreter. It understand queries, one per line. The syntax is classical web syntax, that is + force a word, - remove a word, "e; is use for exact match. The program will print a relevant part of each documents matching your query order by rank.
ft3_searcher
This program gives you an interface to all ft3 functions. You start it with the name of your database and the name of its companion:
ft3_searcher my.db my.ft3A simple
ft3> .h
gives you a list of commands.
By default, it performs a full text search with the words you enter. The syntax is classical on the web:
ft3> word1 word2
looks for all indexed documents containing word1 and word2.
ft3> word1 word2 -word3
looks for all indexed documents containing word1 and word2 but without word3.
ft3> "word1 word2 word3"
looks for all indexed documents containing the phrase word1 followed by word2 followed by word3.
If you switch to SQL mode byft3> .xthe prompt change and you can issue SQL command and call directly ft3 functions.
sql>Some examples:
- Count the number of words:
sql> select count(*) from ft3_words;
- Identify the lang of a text:
sql> select ft3_language('This is an english phrase, it should be
will return
identified as such. The phrase is short and it may failed.)
1 'en'
- Interface to words stemmer:
sql> select ft3_stemmer('fr','francais);
returns
1 mariah
2 maria
3 marie
- Interface to double metaphone algorithm:
sql> select ft3_dmetaphone('francais');
Find all words with the same metaphone:
sql> select word from ft3_words where metaphone=ft3_dmetaphone('maria');
- Interface to edit distance between two strings:
sql> select ft3_levenhstein('test','teste');
Indexer guide
The goal of the indexer is to fill the 2 tables ft3_words
and
ft3_scores
. The first one is a dictionnary for words found
into your documents. The second one is a inverted index which gives a convenient way
to know all documents containing a word. By doing intersection on
ft3_scores
, you can easily find all documents containing
many words.
Both tables contain other fields which will help both at query time and at sorting time.
The indexation process will be followed by steps:
- Create the companion database and fill it with default schema
- Adjust parameters for your language
- Fill the configuration table with your requirements
- Lunch the indexer
- Do some basic checks
Create the companion database and fill it with ft3 schema
Adjust parameters for your language
Fill the configuration table with your requirements
Lunch the indexer
Do some basic checks
Have a look at the 10 more frequent words.SELECT * FROM ft3_words ORDER BY wordscounter DESC LIMIT 10;
Search guide
The dmoz.org search engine demonstration
Dmoz is famous directory on the internet. They provide a dump of there database in RDF (XML) version. Dmoz contains currently more than 4 millions entries. It is an intresting test for ft3 because the data is near the maximum I want to support whithout a distributed solution.
Getting datas
Find 200MO of space on your disk!
Dmoz is large (20MO of text)!Download datas from the internet:
Download ODP datas with a browser or a tool like wget ou curl.
wget http://rdf.dmoz.org/rdf/content.u8.gz
and don't forget to decompress the file:
gunzip content.u8.gz
Filling a sqlite3 databases with dmoz.org content
In the demo subdirectory, a dmoz2sqlite tool has been build. Just try:
dmoz2sqlite content.u8 dmoz.db3
It takes a couple of hours of my laptop ...
Configuring and Indexing the database
Create the companion database and inject ft3 schema:
sqlite3 dmoz.ft3 < ${prefix}/share/ft3/ft3.sqlite
and setup the ft3 configuration:
sqlite3 dmoz.ft3 < ${prefix}/share/ft3/demo/demo_dmoz.config
You can look at the config: basically we index all documents and 4 fields url, title, description and category. Each field is declare with a different property.