[SOLVED] How to get an /exact/ word match in SQLite3's FTS5?
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
How to get an /exact/ word match in SQLite3's FTS5?
I'm looking at the FTS5 full text searching module for SQLite3 and have a question about how to limit the search to an exact string.
If I have the following quick database:
Code:
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE metadata(recno integer, term varchar(25) not null, value varchar(256) not null);
INSERT INTO metadata VALUES(1,'dc.title','a foo/bar baz b');
INSERT INTO metadata VALUES(2,'dc.title','a bar foo/baz b');
INSERT INTO metadata VALUES(3,'dc.title','a barbar foo b');
INSERT INTO metadata VALUES(4,'dc.title','a foofo bar b');
PRAGMA writable_schema=ON;
INSERT INTO sqlite_schema(type,name,tbl_name,rootpage,sql)VALUES('table','text','text',0,'CREATE VIRTUAL TABLE text USING FTS5( m, tokenize="trigram case_sensitive 1")');
CREATE TABLE IF NOT EXISTS 'text_data'(id INTEGER PRIMARY KEY, block BLOB);
INSERT INTO text_data VALUES(1,X'0431');
INSERT INTO text_data VALUES(10,X'000000000101010001010101');
INSERT INTO text_data VALUES(137438953473,X'000000ee043020626101020b0102030102030102090202666f01020301020701020a01020301032f626101020701020b01036120620202020102020301660102020302020202722001020901020501020801020b03016203020502027a2001020d01020d01036261720102080102040104040501020a03017a01020c01020c0103666f2004020703016f01020401020801020b01020401036f206203020d01020802022f6201020601020a0202666f04020602026f2003020c03012f010205010209030166040205010372206201020a03020c0301660202060102090202626103020601037a206201020e01020e0411100b0b0910060a1209080f0b0a070709060b0907');
CREATE TABLE IF NOT EXISTS 'text_idx'(segid, term, pgno, PRIMARY KEY(segid, term)) WITHOUT ROWID;
INSERT INTO text_idx VALUES(1,X'',2);
CREATE TABLE IF NOT EXISTS 'text_content'(id INTEGER PRIMARY KEY, c0);
INSERT INTO text_content VALUES(1,'a foo/bar baz b');
INSERT INTO text_content VALUES(2,'a bar foo/baz b');
INSERT INTO text_content VALUES(3,'a barbar foo b');
INSERT INTO text_content VALUES(4,'a foofo bar b');
CREATE TABLE IF NOT EXISTS 'text_docsize'(id INTEGER PRIMARY KEY, sz BLOB);
INSERT INTO text_docsize VALUES(1,X'0d');
INSERT INTO text_docsize VALUES(2,X'0d');
INSERT INTO text_docsize VALUES(3,X'0c');
INSERT INTO text_docsize VALUES(4,X'0b');
CREATE TABLE IF NOT EXISTS 'text_config'(k PRIMARY KEY, v) WITHOUT ROWID;
INSERT INTO text_config VALUES('version',4);
PRAGMA writable_schema=OFF;
COMMIT;
It create two tables. One has four rows, the other is the FTS representation of those four rows. If I I provide the following query,
Code:
sqlite> SELECT rowid,m FROM text WHERE m MATCH ('foof');
4|a foofo bar b
the database returns partial matches instead of an empty set. The string 'foof', as a whole word, does not occur anywhere. What do I need to set so I can search for exact word matches and not have it glob either end automatially?
INSERT INTO sqlite_schema(type,name,tbl_name,rootpage,sql)VALUES('table','text','text',0,'CREATE VIRTUAL TABLE text USING FTS5( m, tokenize="trigram case_sensitive 1")');
So don't use the trigram tokenizer if you don't want partial matches.
Thanks. Cloudflare is blocking my new SQL example. I'll keep trying. However, I can say that without the trigram tokenizer, it now looks for the whole sequence, if all letters, but still works with a fragment rather than requiring a whole sequence if non-letters are used (or at least I think that is the case):
What I'm aiming to do is find a way to search for exact URLs which have been mixed into a lot of text. It is thus important to differentiate between the following:
such that if I search for a non-existing URL such as "http:/localhost/one/" it will return an empty set, but if I search for "http:/localhost/one/three/" then it will return one record.
Again, I'll keep fighting Cloudflare here to see if I can get a new example through.
Here is the try without the trigram tokenizer. I have added a fifth record.
Code:
sqlite> SELECT rowid,m FROM text WHERE m MATCH '"foo/ba"';
sqlite> SELECT rowid,m FROM text WHERE m MATCH '"foo/bar"';
1|a foo/bar baz b
5|b foo/bar/bar baz b
The first query above returns an empty set, as expected. However, the second search should also return an empty set, but doesn't.
What I'm aiming to do is find a way to search for exact URLs which have been mixed into a lot of text. It is thus important to differentiate between the following:
such that if I search for a non-existing URL such as "http:/localhost/one/" it will return an empty set, but if I search for "http:/localhost/one/three/" then it will return one record.
Code:
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE metadata(recno integer, term varchar(25) not null, value varchar(256) not null);
INSERT INTO metadata VALUES(1,'dc.title','a foo/bar baz b');
INSERT INTO metadata VALUES(2,'dc.title','a bar foo/baz b');
INSERT INTO metadata VALUES(3,'dc.title','a barbar foo b');
INSERT INTO metadata VALUES(4,'dc.title','a foofo bar b');
PRAGMA writable_schema=ON;
INSERT INTO sqlite_schema(type,name,tbl_name,rootpage,sql)VALUES('table','text','text',0,'CREATE VIRTUAL TABLE text USING FTS5( m )');
CREATE TABLE IF NOT EXISTS 'text_data'(id INTEGER PRIMARY KEY, block BLOB);
INSERT INTO text_data VALUES(1,X'0518');
INSERT INTO text_data VALUES(10,X'000000000102020002010101020101');
INSERT INTO text_data VALUES(137438953473,X'0000005502306101020201020201020201020201016201020601020601020501020502026172010204010203020204040362617203020303017a0102050102050103666f6f0102030102040102040402666f040203040f0f0d08090e');
INSERT INTO text_data VALUES(274877906945,X'0000002102306205040207020261720504040303017a0502060103666f6f05020304070806');
CREATE TABLE IF NOT EXISTS 'text_idx'(segid, term, pgno, PRIMARY KEY(segid, term)) WITHOUT ROWID;
INSERT INTO text_idx VALUES(1,X'',2);
INSERT INTO text_idx VALUES(2,X'',2);
CREATE TABLE IF NOT EXISTS 'text_content'(id INTEGER PRIMARY KEY, c0);
INSERT INTO text_content VALUES(1,'a foo/bar baz b');
INSERT INTO text_content VALUES(2,'a bar foo/baz b');
INSERT INTO text_content VALUES(3,'a barbar foo b');
INSERT INTO text_content VALUES(4,'a foofo bar b');
INSERT INTO text_content VALUES(5,'b foo/bar/bar baz b');
CREATE TABLE IF NOT EXISTS 'text_docsize'(id INTEGER PRIMARY KEY, sz BLOB);
INSERT INTO text_docsize VALUES(1,X'05');
INSERT INTO text_docsize VALUES(2,X'05');
INSERT INTO text_docsize VALUES(3,X'04');
INSERT INTO text_docsize VALUES(4,X'04');
INSERT INTO text_docsize VALUES(5,X'06');
CREATE TABLE IF NOT EXISTS 'text_config'(k PRIMARY KEY, v) WITHOUT ROWID;
INSERT INTO text_config VALUES('version',4);
PRAGMA writable_schema=OFF;
COMMIT;
CREATE VIRTUAL TABLE text USING fts5(a, b,
tokenize = "ascii tokenchars '/:%'"
);
(I'm not sure how to escape the quotes so they will be passed correctly to the INSERT INTO sqlite_schema... construct you are using, so I took the example from the docs as the base instead.)
CREATE VIRTUAL TABLE text USING fts5(a, b,
tokenize = "ascii tokenchars '/:%'"
);
(I'm not sure how to escape the quotes so they will be passed correctly to the INSERT INTO sqlite_schema... construct you are using, so I took the example from the docs as the base instead.)
Yes! Thank you! (And thanks everyone else!) Here is the final formula:
Code:
CREATE VIRTUAL TABLE text USING FTS5( m, tokenize = "unicode61 tokenchars '/:%'" );
I had misread the tokenchars option multiple times.
Perl-compatible regular expression support for the SQLite is not necessarily the kind of code which needs frequent updates, but on the other hand not having been updated in over 15 years raises questions.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.