Full Text Search is an extension to DuckDB that allows for search through strings, similar to SQLite's FTS5 extension.
API
The extension adds two PRAGMA
statements to DuckDB: one to create, and one to drop an index. Additionally, a scalar macro stem
is added, which is used internally by the extension.
PRAGMA create_fts_index
create_fts_index(input_table, input_id, *input_values, stemmer='porter', stopwords='english',
ignore='(\\.|[^a-z])+', strip_accents=1, lower=1, overwrite=0)
PRAGMA
that creates a FTS index for the specified table.
Name | Type | Description |
---|---|---|
input_table |
VARCHAR |
Qualified name of specified table e.g. 'table_name' or 'main.table_name' |
input_id |
VARCHAR |
Column name of document identifier e.g. 'document_identifier' |
\*input_values |
VARCHAR |
Column names of the text fields to be indexed (vararg) e.g. 'text_field_1' , 'text_field_2' , …, 'text_field_N' , or '\*' for all columns in input_table of type VARCHAR |
stemmer |
VARCHAR |
The type of stemmer to be used. One of 'arabic' , 'basque' , 'catalan' , 'danish' , 'dutch' , 'english' , 'finnish' , 'french' , 'german' , 'greek' , 'hindi' , 'hungarian' , 'indonesian' , 'irish' , 'italian' , 'lithuanian' , 'nepali' , 'norwegian' , 'porter' , 'portuguese' , 'romanian' , 'russian' , 'serbian' , 'spanish' , 'swedish' , 'tamil' , 'turkish' , or 'none' if no stemming is to be used. Defaults to 'porter' |
stopwords |
VARCHAR |
Qualified name of table containing a single VARCHAR column containing the desired stopwords, or 'none' if no stopwords are to be used. Defaults to 'english' for a pre-defined list of 571 English stopwords |
ignore |
VARCHAR |
Regular expression of patterns to be ignored. Defaults to '(\\.|[^a-z])+' , ignoring all escaped and non-alphabetic lowercase characters |
strip_accents |
BOOLEAN |
Whether to remove accents (e.g. convert á to a ). Defaults to 1 |
lower |
BOOLEAN |
Whether to convert all text to lowercase. Defaults to 1 |
overwrite |
BOOLEAN |
Whether to overwrite an existing index on a table. Defaults to 0 |
This PRAGMA
builds the index under a newly created schema. The schema will be named after the input table: if an index is created on table 'main.table_name'
, then the schema will be named 'fts_main_table_name'
.
PRAGMA drop_fts_index
drop_fts_index(input_table)
Drops a FTS index for the specified table.
Name | Type | Description |
---|---|---|
input_table | VARCHAR |
Qualified name of input table e.g. 'table_name' or 'main.table_name' |
match_bm25
match_bm25(input_id, query_string, fields := NULL, k := 1.2, b:= 0.75, conjunctive := 0)
When an index is built, this retrieval macro is created that can be used to search the index.
Name | Type | Description |
---|---|---|
input_id | VARCHAR |
Column name of document identifier e.g. 'document_identifier' |
query_string | VARCHAR |
The string to search the index for |
fields | VARCHAR |
Comma-separarated list of fields to search in e.g. 'text_field_2,text_field_N' . Defaults to NULL to search all indexed fields |
k | DOUBLE |
Parameter k1 in the Okapi BM25 retrieval model. Defaults to 1.2 |
b | DOUBLE |
Parameter b in the Okapi BM25 retrieval model. Defaults to 0.75 |
conjunctive | BOOLEAN |
Whether to make the query conjunctive i.e. all terms in the query string must be present in order for a document to be retrieved |
stem
stem(input_string, stemmer)
Reduces words to their base. Used internally by the extension.
Name | Type | Description |
---|---|---|
input_string | VARCHAR |
The column or constant to be stemmed |
stemmer | VARCHAR |
The type of stemmer to be used. One of 'arabic' , 'basque' , 'catalan' , 'danish' , 'dutch' , 'english' , 'finnish' , 'french' , 'german' , 'greek' , 'hindi' , 'hungarian' , 'indonesian' , 'irish' , 'italian' , 'lithuanian' , 'nepali' , 'norwegian' , 'porter' , 'portuguese' , 'romanian' , 'russian' , 'serbian' , 'spanish' , 'swedish' , 'tamil' , 'turkish' , or 'none' if no stemming is to be used. |
Example Usage
-- create a table and fill it with text data
CREATE TABLE documents(document_identifier VARCHAR, text_content VARCHAR, author VARCHAR, doc_version INTEGER);
INSERT INTO documents
VALUES ('doc1', 'The mallard is a dabbling duck that breeds throughout the temperate.','Hannes Mühleisen', 3),
('doc2', 'The cat is a domestic species of small carnivorous mammal.', 'Laurens Kuiper', 2);
-- build the index (make both the 'text_content' and 'author' columns searchable)
PRAGMA create_fts_index('documents', 'document_identifier', 'text_content', 'author');
-- search the 'author' field index for documents that are written by Hannes - this retrieves 'doc1'
SELECT text_content, score
FROM (SELECT *, fts_main_documents.match_bm25(document_identifier, 'Muhleisen', fields := 'author') AS score
FROM documents) sq
WHERE score IS NOT NULL
AND doc_version > 2
ORDER BY score DESC;
-- search for documents about 'small cats' - this retrieves 'doc2'
SELECT text_content, score
FROM (SELECT *, fts_main_documents.match_bm25(document_identifier, 'small cats') AS score
FROM documents) sq
WHERE score IS NOT NULL
ORDER BY score DESC;
Caveats
Note that the FTS index will not update automatically when input table changes. A workaround of this limitation can be recreating the index to refresh.