inc/fulltext.php
Provides fulltext search using the FTS3 module of SQLite (https://www.sqlite.org/fts3.html). FTS3 and FTS4 are SQLite virtual table modules that allows users to perform full-text searches on a set of documents. The most common (and effective) way to describe full-text searches is "what Google, Yahoo, and Bing do with documents placed on the World Wide Web".
FTS3 allows searches for words in a text with binary logic (AND, OR). By default, the search must the entire word. "sofa" finds all documents with the word "sofa", but not "sofawiki". FTS3 allows for a modifier at the end "sofa*" so that the word "sofawiki" is found. But it does not allow the modifier "*wiki" to find "sofawiki". This limitation is due to the fact that the search is based on a b-tree structure to find matches in O(log(n)) time. To find all words containing "wiki", all records must be scanned in O(n) time.
The FTS5 module of SQLite provides an experimental feature trigram-search, which allows to deal with partial searches. Unfortunately, FTS5 is rarely installed, so we cannot use it. For the implementation in SofaWiki, trigram-search is emulated. We got a performance which is much better than the current search, so we release it
Installation
Full text search is only possible if SQLite is installed with PHP and if SQLite supports FTS3. This seems to be the case with current CPanel installation, but not with MAMP.
To use it, you must set $swUseFulltext=true in configuration.php
How does it work?
Pages are only indexed if someone has viewed it. With websites, this will happen over time with all pages through the crawlers, as long as the pages are linked.
index.php indexes every page shown if
- the namespace is a searchable namespace ($swSearchNamespaces)
- the action is "view"
- the page does not have the content <!-- nofulltext -->
index.php captures the content ($swParsedContent) and the title ($swParsedTitle) for a given URL, language and revision. The system assumes that these pages are stable. If the page is dynamic, it will be updated, but the search result may not reflect the last version.
All HMTL tags are removed from the content.
The content is then trigramized before it is added to the index: Each letter is expanded to a trigram, interleaved with spaces, so that there is a long sequence of trigram words. The text is four times longer than the original text.
Exemple: "Hello world" becomes "Hel ell llo lo o w wo wor orl rld"
On a full text search, the search terms are also trigramized. "world" becomes "wor orl rld". On FTS3, a quoted sequence must match the words in the exact same order, so it searches for "wor" following "orl" following "rld".
The search results are scored then on the formula: score = 1000*SUM(k/ln(offset)) where k is 1 for matches in the content and 10 for matches in the name
Usage
The full text search is exposed in the relation instruction fulltext
Exemple: fulltext "world"
Binary logic is possible: use space for AND and pipe for OR
Example: fulltext "berlin paris|london" (berlin AND paris) OR london
fulltext returns a relation with the columns score, url, lang, revision, title, body
Full text search is default for the search page using following relation code
fulltext "{{query}}"
update body = _lt."nowiki"._gt.body._lt."/nowiki"._gt
extend t = _leftsquare._leftsquare.url._pipe.title._rightsquare._rightsquare._lt."br"._gt.body
project t
label t ""
print linegrid 50
If you want to adapt the search for your site, define a relation in a template and set the $swCustomSearch variable to the name of the template.
Functions
swOpenFulltext()
Opens the database in site/indexes/fulltext.db
swClearFulltext()
Deletes the database in site/indexes/fulltext.db
swIndexFulltext(url,$lang,$revision,$name,$html)
Indexes one viewed page (called from index.php)
swQueryFulltext($query)
Executes a query and returns a relation with returns a relation with the columns
- score: the higher, the more the page is relevant for the search term
- url: urlname of the page, without language
- lang
- revision
- title: displayname of the page
- body: snippet of 360 characters of the search result
title and body have the search termes set bold.
swQueryFulltextURL($query)
Faster version of swQueryFulltext returning only the url field
swFulltextScore($s)
SQLite helper function using the offsets of FTS3 to calculate the score of the page.
score = 1000*SUM(k/ln(offset)) where k is 1 for matches in the content and 10 for matches in the name
swFulltextByteOffsets($s)
SQLite helper function using the offsets of FTS3 to calculate the offsets of the search terms in the content.
swTrigramize($s)
Helper function to convert text to trigram sequence.
swTrigramizeQuery($s)
Helper function to convert a text query to trigram sequence query.
swDetrigramize($s)
Helper function to cobvert a trigram sequence back to text.
swFulltextSnippet($s,$os,$querylines)
Extracts a snippet from the content with target length 360 charactes and sets search terms bold
Limitations
The language set is that of the actual user. The index algorithm does not know if a sublanguage page was called or if the general page was called. Therefore, the result will return all results of the page. An alternative algorithm could use only the current language, if there are more than one language pages are returned (TBD).