- 05 Jul 2023
- 3 minutes to read
Full text search in PostgreSQL
- Updated on 05 Jul 2023
- 3 minutes to read
Full text search (FTS) allows a sophisticated approach to text search in Totara, particularly within the catalogue. Details of how to set this up using PostgresSQL can be found on this page.
By default, PostgreSQL database supports full text search in phrases, words, and partial searches from the beginning of the word (e.g. "Edu*" will return items that contain the word "education" in them).
Language
PostgreSQL uses the 'English' configuration for full text search (FTS) by default. For a list of available options, see the results of "SELECT cfgname FROM pg_ts_config;". For example:
$CFG->dboptions['ftslanguage'] = 'english';
$CFG->dboptions['ftslanguage'] = 'simple';
$CFG->dboptions['ftslanguage'] = 'german';
PostgreSQL does not support Japanese and other languages with very short words without spaces in between. Enable the following setting to get a basic experimental support of these languages:
$CFG->dboptions['fts3bworkaround'] = true;
php admin/cli/fts_repopulate_tables.php
Minimum search term length
PostgreSQL doesn't have specific default length requirements and may find all of the words, provided they are words commonly found in a dictionary.
Stop words
Stop words are a set of words that will be excluded from index and search queries. They depend on the language being used during indexing and search, as well as dictionary presence for database installation.
For example, if a user searches for "Hotels in Wellington" then the system will exclude "in" from the search.
More details on PostgreSQL stop words can be found in the PostgreSQL help documentation.
Morphological search
An important thing to note is that PostgreSQL does not support advanced language functions out of the box, so you will need to install a specific Hunspell dictionary for the language that you intend to use. There are a few places you can download the languages from, such as LibreOffice Extensions and Mozilla Addons.
Below you can find an example of how to enable morphological search in PostgreSQL on Ubuntu 18.04 for the Swedish language. This is not a definitive guide, rather it is an example of how it can be done in one particular case. Please, refer to PostgreSQL documentation to get more information.
For example, if we take the Swedish language, the following will not work out of the box:
-- This will return {äppelträd} as lexemes instead of {äppelträd,äppel,träd} as we would expect
select * from ts_debug('pg_catalog.swedish', 'äppelträd');
In order for PostgreSQL to interpret the above correctly, you need to install the Hunspell Swedish dictionary by following these steps:
- Go to Mozilla Addons and download the Swedish dictionary (right-click on the Add to Firefox button and click on Save link as).
- Extract the file that was downloaded (sometimes the downloaded file does not have an extension so just add .oxt)
- Copy the *.aff, *.dic files to the tsearch_data directory of PostgreSQL and rename to sv_se.affix and sv_se.dict.
- Confirm that the files are UTF-8 encoded. The below will print out the current encoding of the files:
file -i sv_se.*
If the files are not UTF-8 encoded you can execute the following commands (just replace ISO-8859-1 with the encoding output from above):
iconv -f ISO-8859-1 -t UTF-8//TRANSLIT sv_se.affix -o sv_se.affix
iconv -f ISO-8859-1 -t UTF-8//TRANSLIT sv_se.dict -o sv_se.dict
- Execute the following commands in PostgreSQL:
CREATE TEXT SEARCH DICTIONARY sv_hunspell(
Template = ispell,
DictFile = sv_se,
AffFile = sv_se,
Stopwords = swedish
);
CREATE TEXT SEARCH CONFIGURATION sv_hunspell(parser = default);
ALTER TEXT SEARCH CONFIGURATION sv_hunspell ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,word, hword, hword_part WITH sv_hunspell;
-- This will return the expected {äppelträd,äppel,träd} lexemes:
select * from ts_debug('sv_hunspell', 'äppelträd');
© Copyright 2024 Totara Learning Solutions. All rights reserved. Some content originally obtained via GPLv3 license and continues to be available under GPLv3. All other content is the sole copyright of Totara Learning Solutions.