- 26 Sep 2024
- 2 minutes to read
Full text search in MySQL
- Updated on 26 Sep 2024
- 2 minutes to read
Full text search (FTS) allows a sophisticated approach to text search in Totara, particularly within the catalogue. Details of how to set this up using MySQL can be found on this page.
Language support
MySQL uses case- and accent-insensitive collation for full text search (FTS) by default. However, you can specify a different collation, for example:
$CFG->dboptions['ftslanguage'] = 'utf8_unicode_ci';
$CFG->dboptions['ftslanguage'] = 'utf8mb4_0900_as_ci';
$CFG->dboptions['ftslanguage'] = 'utf8mb4_de_pb_0900_ai_ci';
MySQL does not support Japanese and other languages with very short words without spaces in between. Enable the following setting to get a basic experimental support of these languages:
$CFG->dboptions['fts3bworkaround'] = true;
After any of these changes, repopulate FTS tables and rebuild indexes by running:
php admin/cli/fts_repopulate_tables.php
php admin/cli/fts_rebuild_indexes.php
Minimum search term length
The MySQL default search character limit can be changed by editing the MySQL configuration file:
[mysqld]
innodb_ft_min_token_size=2
ft_min_word_len=2
Stop words
Stop words are a set of words that will be excluded from index and search queries. They depend on language being used during indexing and search, as well as dictionary presence for database installation. For example, if a user searches for 'Hotels in Wellington' then the system excludes 'in' from the search.
More details on MySQL stop words can be found in the MySQL help documentation.
Morphological search
Ngram is a built-in MySQL full-text parser. It determines the beginning and end of words using white space and particular letter sequences. It is usually enabled by default, but this can depend on distribution. It can parse two or more words stems from compound words. For example, for the German word 'Fußballweltmeisterschaft', it will search by words like 'meister', 'schaft' and so on. A full-text index made without using this plugin will not able to split the word into smaller pieces, and the SQL that is trying to look for a keyword such as 'meister' will not be able to find the record.
For some ideographic languages (like the example above), the normal full-text index will be limited to what it can search for, since there are no delimiters for those words. This is where Ngram can be useful. It is a contiguous sequence of a number of characters from a sequence of text. The main function of the Ngram full-text parser is tokenising a sequence of text into a contiguous sequence of N characters.
In some cases, Ngram can return false positive results (although they are low rated and appear towards the end of the list), so it is advised that you confirm that search behaviour suits your needs after enabling Ngram support.
To enable Ngram, change your config.php setting as follows:
$CFG->dboptions['ftsngram'] = true;
Next, run:
php admin/cli/fts_rebuild_indexes.php
This setting has effect only in MySQL.
Unaccented search
By default, MySQL supports diacritics, but it depends on collation and language defined in $CFG->dboptions['ftslanguage']. For example, when collation is utf8_general_ci, then the search query containing either word 'första' or 'forsta' will return a record containing word 'första'.
© Copyright 2024 Totara Learning Solutions. All rights reserved. Some content originally obtained via GPLv3 license and continues to be available under GPLv3. All other content is the sole copyright of Totara Learning Solutions.