Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Full Text Searching (or just_text search_) provides the capability to identify natural-language_documents_that satisfy a_query_, and optionally to sort them by relevance to the query. The most common type of search is to find all documents containing given_query terms_and return them in order of their_similarity_to the query. Notions ofquery
andsimilarity
are very flexible and depend on the specific application. The simplest search considersquery
as a set of words andsimilarity
as the frequency of query words in the document.
Textual search operators have existed in databases for years.PostgreSQLhas~
,~*
,LIKE
, andILIKE
operators for textual data types, but they lack many essential properties required by modern information systems:
There is no linguistic support, even for English. Regular expressions are not sufficient because they cannot easily handle derived words, e.g.,satisfies
andsatisfy
. You might miss documents that containsatisfies
, although you probably would like to find them when searching forsatisfy
. It is possible to useOR
to search for multiple derived forms, but this is tedious and error-prone (some words can have several thousand derivatives).
They provide no ordering (ranking) of search results, which makes them ineffective when thousands of matching documents are found.
They tend to be slow because there is no index support, so they must process all documents for every search.
Full text indexing allows documents to be_preprocessed_and an index saved for later rapid searching. Preprocessing includes:
Parsing documents intotokens. It is useful to identify various classes of tokens, e.g., numbers, words, complex words, email addresses, so that they can be processed differently. In principle token classes depend on the specific application, but for most purposes it is adequate to use a predefined set of classes.PostgreSQLuses a_parser_to perform this step. A standard parser is provided, and custom parsers can be created for specific needs.
Converting tokens intolexemes. A lexeme is a string, just like a token, but it has been_normalized_so that different forms of the same word are made alike. For example, normalization almost always includes folding upper-case letters to lower-case, and often involves removal of suffixes (such as_s
ores
in English). This allows searches to find variant forms of the same word, without tediously entering all the possible variants. Also, this step typically eliminates_stop words, which are words that are so common that they are useless for searching. (In short, then, tokens are raw fragments of the document text, while lexemes are words that are believed useful for indexing and searching.)PostgreSQLuses_dictionaries_to perform this step. Various standard dictionaries are provided, and custom ones can be created for specific needs.
Storing preprocessed documents optimized for searching. For example, each document can be represented as a sorted array of normalized lexemes. Along with the lexemes it is often desirable to store positional information to use for_proximity ranking_, so that a document that contains a more“dense”region of query words is assigned a higher rank than one with scattered query words.
Dictionaries allow fine-grained control over how tokens are normalized. With appropriate dictionaries, you can:
Define stop words that should not be indexed.
Map synonyms to a single word usingIspell.
Map phrases to a single word using a thesaurus.
Map different variations of a word to a canonical form using anIspelldictionary.
Map different variations of a word to a canonical form usingSnowballstemmer rules.
A data typetsvector
is provided for storing preprocessed documents, along with a typetsquery
for representing processed queries (Section 8.11). There are many functions and operators available for these data types (Section 9.13), the most important of which is the match operator@@
, which we introduce inSection 12.1.2. Full text searches can be accelerated using indexes (Section 12.9).
A_document_is the unit of searching in a full text search system; for example, a magazine article or email message. The text search engine must be able to parse documents and store associations of lexemes (key words) with their parent document. Later, these associations are used to search for documents that contain query words.
For searches withinPostgreSQL, a document is normally a textual field within a row of a database table, or possibly a combination (concatenation) of such fields, perhaps stored in several tables or obtained dynamically. In other words, a document can be constructed from different parts for indexing and it might not be stored anywhere as a whole. For example:
Actually, in these example queries,coalesce
should be used to prevent a singleNULL
attribute from causing aNULL
result for the whole document.
Another possibility is to store the documents as simple text files in the file system. In this case, the database can be used to store the full text index and to execute searches, and some unique identifier can be used to retrieve the document from the file system. However, retrieving files from outside the database requires superuser permissions or special function support, so this is usually less convenient than keeping all the data insidePostgreSQL. Also, keeping everything inside the database allows easy access to document metadata to assist in indexing and display.
For text search purposes, each document must be reduced to the preprocessedtsvector
format. Searching and ranking are performed entirely on thetsvector
representation of a document — the original text need only be retrieved when the document has been selected for display to a user. We therefore often speak of thetsvector
as being the document, but of course it is only a compact representation of the full document.
Full text searching inPostgreSQLis based on the match operator@@
, which returnstrue
if atsvector
(document) matches atsquery
(query). It doesn't matter which data type is written first:
As the above example suggests, atsquery
is not just raw text, any more than atsvector
is. Atsquery
contains search terms, which must be already-normalized lexemes, and may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators. (For syntax details seeSection 8.11.2.) There are functionsto_tsquery
,plainto_tsquery
, andphraseto_tsquery
that are helpful in converting user-written text into a propertsquery
, primarily by normalizing words appearing in the text. Similarly,to_tsvector
is used to parse and normalize a document string. So in practice a text search match would look more like this:
Observe that this match would not succeed if written as
since here no normalization of the wordrats
will occur. The elements of atsvector
are lexemes, which are assumed already normalized, sorats
does not matchrat
.
The@@
operator also supportstext
input, allowing explicit conversion of a text string totsvector
ortsquery
to be skipped in simple cases. The variants available are:
The first two of these we saw already. The formtext@@tsquery
is equivalent toto_tsvector(x) @@ y
. The formtext@@text
is equivalent toto_tsvector(x) @@ plainto_tsquery(y)
.
Within atsquery
, the&
(AND) operator specifies that both its arguments must appear in the document to have a match. Similarly, the|
(OR) operator specifies that at least one of its arguments must appear, while the!
(NOT) operator specifies that its argument must_not_appear in order to have a match. For example, the queryfat & ! rat
matches documents that containfat
but notrat
.
Searching for phrases is possible with the help of the<->
(FOLLOWED BY)tsquery
operator, which matches only if its arguments have matches that are adjacent and in the given order. For example:
There is a more general version of the FOLLOWED BY operator having the form<N
>, where_N
_is an integer standing for the difference between the positions of the matching lexemes.<1>
is the same as<->
, while<2>
allows exactly one other lexeme to appear between the matches, and so on. Thephraseto_tsquery
function makes use of this operator to construct atsquery
that can match a multi-word phrase when some of the words are stop words. For example:
A special case that's sometimes useful is that<0>
can be used to require that two patterns match the same word.
Parentheses can be used to control nesting of thetsquery
operators. Without parentheses,|
binds least tightly, then&
, then<->
, and!
most tightly.
It's worth noticing that the AND/OR/NOT operators mean something subtly different when they are within the arguments of a FOLLOWED BY operator than when they are not, because within FOLLOWED BY the exact position of the match is significant. For example, normally!x
matches only documents that do not containx
anywhere. But!x <-> y
matchesy
if it is not immediately after anx
; an occurrence ofx
elsewhere in the document does not prevent a match. Another example is thatx & y
normally only requires thatx
andy
both appear somewhere in the document, but(x & y) <-> z
requiresx
andy
to match at the same place, immediately before az
. Thus this query behaves differently fromx <-> z & y <-> z
, which will match a document containing two separate sequencesx z
andy z
. (This specific query is useless as written, sincex
andy
could not match at the same place; but with more complex situations such as prefix-match patterns, a query of this form could be useful.)
The above are all simple text search examples. As mentioned before, full text search functionality includes the ability to do many more things: skip indexing certain words (stop words), process synonyms, and use sophisticated parsing, e.g., parse based on more than just white space. This functionality is controlled by_text search configurations_.PostgreSQLcomes with predefined configurations for many languages, and you can easily create your own configurations. (psql's\dF
command shows all available configurations.)
During installation an appropriate configuration is selected anddefault_text_search_configis set accordingly inpostgresql.conf
. If you are using the same text search configuration for the entire cluster you can use the value inpostgresql.conf
. To use different configurations throughout the cluster but the same configuration within any one database, useALTER DATABASE ... SET
. Otherwise, you can setdefault_text_search_config
in each session.
Each text search function that depends on a configuration has an optionalregconfig
argument, so that the configuration to use can be specified explicitly.default_text_search_config
is used only when this argument is omitted.
To make it easier to build custom text search configurations, a configuration is built up from simpler database objects.PostgreSQL's text search facility provides four types of configuration-related database objects:
_Text search parsers_break documents into tokens and classify each token (for example, as words or numbers).
_Text search dictionaries_convert tokens to normalized form and reject stop words.
_Text search templates_provide the functions underlying dictionaries. (A dictionary simply specifies a template and a set of parameters for the template.)
_Text search configurations_select a parser and a set of dictionaries to use to normalize the tokens produced by the parser.
Text search parsers and templates are built from low-level C functions; therefore it requires C programming ability to develop new ones, and superuser privileges to install one into a database. (There are examples of add-on parsers and templates in thecontrib/
area of thePostgreSQLdistribution.) Since dictionaries and configurations just parameterize and connect together some underlying parsers and templates, no special privilege is needed to create a new dictionary or configuration. Examples of creating custom dictionaries and configurations appear later in this chapter.
The examples in the previous section illustrated full text matching using simple constant strings. This section shows how to search table data, optionally using indexes.
It is possible to do a full text search without an index. A simple query to print thetitle
of each row that contains the wordfriend
in itsbody
field is:
This will also find related words such asfriends
andfriendly
, since all these are reduced to the same normalized lexeme.
The query above specifies that theenglish
configuration is to be used to parse and normalize the strings. Alternatively we could omit the configuration parameters:
This query will use the configuration set bydefault_text_search_config.
A more complex example is to select the ten most recent documents that containcreate
andtable
in thetitle
orbody
:
For clarity we omitted thecoalesce
function calls which would be needed to find rows that containNULL
in one of the two fields.
Although these queries will work without an index, most applications will find this approach too slow, except perhaps for occasional ad-hoc searches. Practical use of text searching usually requires creating an index.
We can create aGINindex (Section 12.9) to speed up text searches:
Notice that the 2-argument version ofto_tsvector
is used. Only text search functions that specify a configuration name can be used in expression indexes (Section 11.7). This is because the index contents must be unaffected bydefault_text_search_config. If they were affected, the index contents might be inconsistent because different entries could containtsvector
s that were created with different text search configurations, and there would be no way to guess which was which. It would be impossible to dump and restore such an index correctly.
Because the two-argument version ofto_tsvector
was used in the index above, only a query reference that uses the 2-argument version ofto_tsvector
with the same configuration name will use that index. That is,WHERE to_tsvector('english', body) @@ 'a & b'
can use the index, butWHERE to_tsvector(body) @@ 'a & b'
cannot. This ensures that an index will be used only with the same configuration used to create the index entries.
It is possible to set up more complex expression indexes wherein the configuration name is specified by another column, e.g.:
whereconfig_name
is a column in thepgweb
table. This allows mixed configurations in the same index while recording which configuration was used for each index entry. This would be useful, for example, if the document collection contained documents in different languages. Again, queries that are meant to use the index must be phrased to match, e.g.,WHERE to_tsvector(config_name, body) @@ 'a & b'
.
Indexes can even concatenate columns:
Another approach is to create a separatetsvector
column to hold the output ofto_tsvector
. This example is a concatenation oftitle
andbody
, usingcoalesce
to ensure that one field will still be indexed when the other isNULL
:
Then we create aGINindex to speed up the search:
Now we are ready to perform a fast full text search:
When using a separate column to store thetsvector
representation, it is necessary to create a trigger to keep thetsvector
column current anytimetitle
orbody
changes.Section 12.4.3explains how to do that.
One advantage of the separate-column approach over an expression index is that it is not necessary to explicitly specify the text search configuration in queries in order to make use of the index. As shown in the example above, the query can depend ondefault_text_search_config
. Another advantage is that searches will be faster, since it will not be necessary to redo theto_tsvector
calls to verify index matches. (This is more important when using a GiST index than a GIN index; seeSection 12.9.) The expression-index approach is simpler to set up, however, and it requires less disk space since thetsvector
representation is not stored explicitly.
This section describes additional functions and operators that are useful in connection with text search.
Section 12.3.1showed how raw textual documents can be converted intotsvector
values.PostgreSQLalso provides functions and operators that can be used to manipulate documents that are already intsvector
form.
tsvector
||
tsvector
Thetsvector
concatenation operator returns a vector which combines the lexemes and positional information of the two vectors given as arguments. Positions and weight labels are retained during the concatenation. Positions appearing in the right-hand vector are offset by the largest position mentioned in the left-hand vector, so that the result is nearly equivalent to the result of performingto_tsvector
on the concatenation of the two original document strings. (The equivalence is not exact, because any stop-words removed from the end of the left-hand argument will not affect the result, whereas they would have affected the positions of the lexemes in the right-hand argument if textual concatenation were used.)
One advantage of using concatenation in the vector form, rather than concatenating text before applyingto_tsvector
, is that you can use different configurations to parse different sections of the document. Also, because thesetweight
function marks all lexemes of the given vector the same way, it is necessary to parse the text and dosetweight
before concatenating if you want to label different parts of the document with different weights.
setweight(
vector
tsvector
,
weight
"char"
) returns
tsvector
setweight
returns a copy of the input vector in which every position has been labeled with the givenweight
, eitherA
,B
,C
, orD
. (D
is the default for new vectors and as such is not displayed on output.) These labels are retained when vectors are concatenated, allowing words from different parts of a document to be weighted differently by ranking functions.
Note that weight labels apply to_positions_, not_lexemes_. If the input vector has been stripped of positions thensetweight
does nothing.
length(
vector
tsvector
) returns
integer
Returns the number of lexemes stored in the vector.
strip(
vector
tsvector
) returns
tsvector
Returns a vector that lists the same lexemes as the given vector, but lacks any position or weight information. The result is usually much smaller than an unstripped vector, but it is also less useful. Relevance ranking does not work as well on stripped vectors as unstripped ones. Also, the<->
(FOLLOWED BY)tsquery
operator will never match stripped input, since it cannot determine the distance between lexeme occurrences.
A full list oftsvector
-related functions is available inTable 9.41.
Section 12.3.2showed how raw textual queries can be converted intotsquery
values.PostgreSQLalso provides functions and operators that can be used to manipulate queries that are already intsquery
form.
tsquery
&
&
tsquery
Returns the AND-combination of the two given queries.
tsquery
||
tsquery
Returns the OR-combination of the two given queries.
!!
tsquery
Returns the negation (NOT) of the given query.
tsquery
<
-
>
tsquery
Returns a query that searches for a match to the first given query immediately followed by a match to the second given query, using the<->
(FOLLOWED BY)tsquery
operator. For example:
tsquery_phrase(
query1
tsquery
,
query2
tsquery
[,
distance
integer
]) returns
tsquery
Returns a query that searches for a match to the first given query followed by a match to the second given query at a distance of atdistance
_lexemes, using the<N
_>tsquery
operator. For example:
numnode(
query
tsquery
) returns
integer
Returns the number of nodes (lexemes plus operators) in atsquery
. This function is useful to determine if the_query
_is meaningful (returns > 0), or contains only stop words (returns 0). Examples:
querytree(
query
tsquery
) returns
text
Returns the portion of atsquery
that can be used for searching an index. This function is useful for detecting unindexable queries, for example those containing only stop words or only negated terms. For example:
Thets_rewrite
family of functions search a giventsquery
for occurrences of a target subquery, and replace each occurrence with a substitute subquery. In essence this operation is atsquery
-specific version of substring replacement. A target and substitute combination can be thought of as a_query rewrite rule_. A collection of such rewrite rules can be a powerful search aid. For example, you can expand the search using synonyms (e.g.,new york
,big apple
,nyc
,gotham
) or narrow the search to direct the user to some hot topic. There is some overlap in functionality between this feature and thesaurus dictionaries (Section 12.6.4). However, you can modify a set of rewrite rules on-the-fly without reindexing, whereas updating a thesaurus requires reindexing to be effective.
ts_rewrite (
query
tsquery
,
target
tsquery
,
substitute
tsquery
) returns
tsquery
This form ofts_rewrite
simply applies a single rewrite rule:target
_is replaced bysubstitute
wherever it appears inquery
_. For example:
ts_rewrite (
query
tsquery
,
select
text
) returns
tsquery
This form ofts_rewrite
accepts a startingquery
_and a SQLselect
command, which is given as a text string. Theselect
must yield two columns oftsquery
type. For each row of theselect
result, occurrences of the first column value (the target) are replaced by the second column value (the substitute) within the currentquery
_value. For example:
Note that when multiple rewrite rules are applied in this way, the order of application can be important; so in practice you will want the source query toORDER BY
some ordering key.
Let's consider a real-life astronomical example. We'll expand querysupernovae
using table-driven rewriting rules:
We can change the rewriting rules just by updating the table:
Rewriting can be slow when there are many rewriting rules, since it checks every rule for a possible match. To filter out obvious non-candidate rules we can use the containment operators for thetsquery
type. In the example below, we select only those rules which might match the original query:
When using a separate column to store thetsvector
representation of your documents, it is necessary to create a trigger to update thetsvector
column when the document content columns change. Two built-in trigger functions are available for this, or you can write your own.
These trigger functions automatically compute atsvector
column from one or more textual columns, under the control of parameters specified in theCREATE TRIGGER
command. An example of their use is:
Having created this trigger, any change intitle
orbody
will automatically be reflected intotsv
, without the application having to worry about it.
The first trigger argument must be the name of thetsvector
column to be updated. The second argument specifies the text search configuration to be used to perform the conversion. Fortsvector_update_trigger
, the configuration name is simply given as the second trigger argument. It must be schema-qualified as shown above, so that the trigger behavior will not change with changes insearch_path
. Fortsvector_update_trigger_column
, the second trigger argument is the name of another table column, which must be of typeregconfig
. This allows a per-row selection of configuration to be made. The remaining argument(s) are the names of textual columns (of typetext
,varchar
, orchar
). These will be included in the document in the order given. NULL values will be skipped (but the other columns will still be indexed).
A limitation of these built-in triggers is that they treat all the input columns alike. To process columns differently — for example, to weight title differently from body — it is necessary to write a custom trigger. Here is an example usingPL/pgSQLas the trigger language:
Keep in mind that it is important to specify the configuration name explicitly when creatingtsvector
values inside triggers, so that the column's contents will not be affected by changes todefault_text_search_config
. Failure to do this is likely to lead to problems such as search results changing after a dump and reload.
The functionts_stat
is useful for checking your configuration and for finding stop-word candidates.
_sqlquery
_is a text value containing an SQL query which must return a singletsvector
column.ts_stat
executes the query and returns statistics about each distinct lexeme (word) contained in thetsvector
data. The columns returned are
wordtext
— the value of a lexeme
ndocinteger
— number of documents (tsvector
s) the word occurred in
nentryinteger
— total number of occurrences of the word
If_weights
_is supplied, only occurrences having one of those weights are counted.
For example, to find the ten most frequent words in a document collection:
The same, but counting only word occurrences with weightA
orB
:
To implement full text searching there must be a function to create atsvector
from a document and atsquery
from a user query. Also, we need to return results in a useful order, so we need a function that compares documents with respect to their relevance to the query. It's also important to be able to display the results nicely.PostgreSQLprovides support for all of these functions.
PostgreSQLprovides the functionto_tsvector
for converting a document to thetsvector
data type.
to_tsvector
parses a textual document into tokens, reduces the tokens to lexemes, and returns atsvector
which lists the lexemes together with their positions in the document. The document is processed according to the specified or default text search configuration. Here is a simple example:
In the example above we see that the resultingtsvector
does not contain the wordsa
,on
, orit
, the wordrats
becamerat
, and the punctuation sign-
was ignored.
Theto_tsvector
function internally calls a parser which breaks the document text into tokens and assigns a type to each token. For each token, a list of dictionaries (Section 12.6) is consulted, where the list can vary depending on the token type. The first dictionary that_recognizes_the token emits one or more normalized_lexemes_to represent the token. For example,rats
becamerat
because one of the dictionaries recognized that the wordrats
is a plural form ofrat
. Some words are recognized as_stop words_(Section 12.6.1), which causes them to be ignored since they occur too frequently to be useful in searching. In our example these area
,on
, andit
. If no dictionary in the list recognizes the token then it is also ignored. In this example that happened to the punctuation sign-
because there are in fact no dictionaries assigned for its token type (Space symbols
), meaning space tokens will never be indexed. The choices of parser, dictionaries and which types of tokens to index are determined by the selected text search configuration (Section 12.7). It is possible to have many different configurations in the same database, and predefined configurations are available for various languages. In our example we used the default configurationenglish
for the English language.
The functionsetweight
can be used to label the entries of atsvector
with a given_weight_, where a weight is one of the lettersA
,B
,C
, orD
. This is typically used to mark entries coming from different parts of a document, such as title versus body. Later, this information can be used for ranking of search results.
Becauseto_tsvector
(NULL
) will returnNULL
, it is recommended to usecoalesce
whenever a field might be null. Here is the recommended method for creating atsvector
from a structured document:
Here we have usedsetweight
to label the source of each lexeme in the finishedtsvector
, and then merged the labeledtsvector
values using thetsvector
concatenation operator||
. (Section 12.4.1gives details about these operations.)
PostgreSQLprovides the functionsto_tsquery
,plainto_tsquery
, andphraseto_tsquery
for converting a query to thetsquery
data type.to_tsquery
offers access to more features than eitherplainto_tsquery
orphraseto_tsquery
, but it is less forgiving about its input.
to_tsquery
creates atsquery
value fromquerytext
, which must consist of single tokens separated by thetsquery
operators&
(AND),|
(OR),!
(NOT), and<->
(FOLLOWED BY), possibly grouped using parentheses. In other words, the input toto_tsquery
must already follow the general rules fortsquery
input, as described inSection 8.11.2. The difference is that while basictsquery
input takes the tokens at face value,to_tsquery
normalizes each token into a lexeme using the specified or default configuration, and discards any tokens that are stop words according to the configuration. For example:
As in basictsquery
input, weight(s) can be attached to each lexeme to restrict it to match onlytsvector
lexemes of those weight(s). For example:
Also,*
can be attached to a lexeme to specify prefix matching:
Such a lexeme will match any word in atsvector
that begins with the given string.
to_tsquery
can also accept single-quoted phrases. This is primarily useful when the configuration includes a thesaurus dictionary that may trigger on such phrases. In the example below, a thesaurus contains the rulesupernovae stars : sn
:
Without quotes,to_tsquery
will generate a syntax error for tokens that are not separated by an AND, OR, or FOLLOWED BY operator.
plainto_tsquery
transforms the unformatted text_querytext
_to atsquery
value. The text is parsed and normalized much as forto_tsvector
, then the&
(AND)tsquery
operator is inserted between surviving words.
Example:
Note thatplainto_tsquery
will not recognizetsquery
operators, weight labels, or prefix-match labels in its input:
Here, all the input punctuation was discarded as being space symbols.
phraseto_tsquery
behaves much likeplainto_tsquery
, except that it inserts the<->
(FOLLOWED BY) operator between surviving words instead of the&
(AND) operator. Also, stop words are not simply discarded, but are accounted for by inserting<N
>operators rather than<->
operators. This function is useful when searching for exact lexeme sequences, since the FOLLOWED BY operators check lexeme order not just the presence of all the lexemes.
Example:
Likeplainto_tsquery
, thephraseto_tsquery
function will not recognizetsquery
operators, weight labels, or prefix-match labels in its input:
Ranking attempts to measure how relevant documents are to a particular query, so that when there are many matches the most relevant ones can be shown first.PostgreSQLprovides two predefined ranking functions, which take into account lexical, proximity, and structural information; that is, they consider how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur. However, the concept of relevancy is vague and very application-specific. Different applications might require additional information for ranking, e.g., document modification time. The built-in ranking functions are only examples. You can write your own ranking functions and/or combine their results with additional factors to fit your specific needs.
The two ranking functions currently available are:
ts_rank([
weights
float4[]
,
]
vector
tsvector
,
query
tsquery
[
,
normalization
integer
]) returns
float4
Ranks vectors based on the frequency of their matching lexemes.
ts_rank_cd([
weights
float4[]
,
]
vector
tsvector
,
query
tsquery
[
,
normalization
integer
]) returns
float4
This function computes the_cover density_ranking for the given document vector and query, as described in Clarke, Cormack, and Tudhope's "Relevance Ranking for One to Three Term Queries" in the journal "Information Processing and Management", 1999. Cover density is similar tots_rank
ranking except that the proximity of matching lexemes to each other is taken into consideration.
This function requires lexeme positional information to perform its calculation. Therefore, it ignores any“stripped”lexemes in thetsvector
. If there are no unstripped lexemes in the input, the result will be zero. (SeeSection 12.4.1for more information about thestrip
function and positional information intsvector
s.)
For both these functions, the optional_weights
_argument offers the ability to weigh word instances more or less heavily depending on how they are labeled. The weight arrays specify how heavily to weigh each category of word, in the order:
If no_weights
_are provided, then these defaults are used:
Typically weights are used to mark words from special areas of the document, like the title or an initial abstract, so they can be treated with more or less importance than words in the document body.
Since a longer document has a greater chance of containing a query term it is reasonable to take into account document size, e.g., a hundred-word document with five instances of a search word is probably more relevant than a thousand-word document with five instances. Both ranking functions take an integer_normalization
_option that specifies whether and how a document's length should impact its rank. The integer option controls several behaviors, so it is a bit mask: you can specify one or more behaviors using|
(for example,2|4
).
0 (the default) ignores the document length
1 divides the rank by 1 + the logarithm of the document length
2 divides the rank by the document length
4 divides the rank by the mean harmonic distance between extents (this is implemented only byts_rank_cd
)
8 divides the rank by the number of unique words in document
16 divides the rank by 1 + the logarithm of the number of unique words in document
32 divides the rank by itself + 1
If more than one flag bit is specified, the transformations are applied in the order listed.
It is important to note that the ranking functions do not use any global information, so it is impossible to produce a fair normalization to 1% or 100% as sometimes desired. Normalization option 32 (rank/(rank+1)
) can be applied to scale all ranks into the range zero to one, but of course this is just a cosmetic change; it will not affect the ordering of the search results.
Here is an example that selects only the ten highest-ranked matches:
This is the same example using normalized ranking:
Ranking can be expensive since it requires consulting thetsvector
of each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches.
To present search results it is ideal to show a part of each document and how it is related to the query. Usually, search engines show fragments of the document with marked search terms.PostgreSQLprovides a functionts_headline
that implements this functionality.
ts_headline
accepts a document along with a query, and returns an excerpt from the document in which terms from the query are highlighted. The configuration to be used to parse the document can be specified byconfig
; if_config
_is omitted, thedefault_text_search_config
configuration is used.
If anoptions
_string is specified it must consist of a comma-separated list of one or moreoption=value
_pairs. The available options are:
StartSel
,StopSel
: the strings with which to delimit query words appearing in the document, to distinguish them from other excerpted words. You must double-quote these strings if they contain spaces or commas.
MaxWords
,MinWords
: these numbers determine the longest and shortest headlines to output.
ShortWord
: words of this length or less will be dropped at the start and end of a headline. The default value of three eliminates common English articles.
HighlightAll
: Boolean flag; iftrue
the whole document will be used as the headline, ignoring the preceding three parameters.
MaxFragments
: maximum number of text excerpts or fragments to display. The default value of zero selects a non-fragment-oriented headline generation method. A value greater than zero selects fragment-based headline generation. This method finds text fragments with as many query words as possible and stretches those fragments around the query words. As a result query words are close to the middle of each fragment and have words on each side. Each fragment will be of at mostMaxWords
and words of lengthShortWord
or less are dropped at the start and end of each fragment. If not all query words are found in the document, then a single fragment of the firstMinWords
in the document will be displayed.
FragmentDelimiter
: When more than one fragment is displayed, the fragments will be separated by this string.
Any unspecified options receive these defaults:
For example:
ts_headline
uses the original document, not atsvector
summary, so it can be slow and should be used with care.
A text search configuration specifies all options necessary to transform a document into atsvector
: the parser to use to break text into tokens, and the dictionaries to use to transform each token into a lexeme. Every call ofto_tsvector
orto_tsquery
needs a text search configuration to perform its processing. The configuration parameterspecifies the name of the default configuration, which is the one used by text search functions if an explicit configuration parameter is omitted. It can be set inpostgresql.conf
, or set for an individual session using theSET
command.
Several predefined text search configurations are available, and you can create custom configurations easily. To facilitate management of text search objects, a set ofSQLcommands is available, and there are severalpsqlcommands that display information about text search objects ().
As an example we will create a configurationpg
, starting by duplicating the built-inenglish
configuration:
We will use a PostgreSQL-specific synonym list and store it in$SHAREDIR/tsearch_data/pg_dict.syn
. The file contents look like:
We define the synonym dictionary like this:
Next we register theIspelldictionaryenglish_ispell
, which has its own configuration files:
Now we can set up the mappings for words in configurationpg
:
We choose not to index or search some token types that the built-in configuration does handle:
Now we can test our configuration:
The next step is to set the session to use the new configuration, which was created in thepublic
schema:
Text search parsers are responsible for splitting raw document text into_tokens_and identifying each token's type, where the set of possible types is defined by the parser itself. Note that a parser does not modify the text at all — it simply identifies plausible word boundaries. Because of this limited scope, there is less need for application-specific custom parsers than there is for custom dictionaries. At presentPostgreSQLprovides just one built-in parser, which has been found to be useful for a wide range of applications.
The built-in parser is namedpg_catalog.default
. It recognizes 23 token types, shown in.
Table 12.1. Default Parser's Token Types
Alias | Description | Example |
---|
The parser's notion of a“letter”is determined by the database's locale setting, specificallylc_ctype
. Words containing only the basic ASCII letters are reported as a separate token type, since it is sometimes useful to distinguish them. In most European languages, token typesword
andasciiword
should be treated alike.
email
does not support all valid email characters as defined by RFC 5322. Specifically, the only non-alphanumeric characters supported for email user names are period, dash, and underscore.
It is possible for the parser to produce overlapping tokens from the same piece of text. As an example, a hyphenated word will be reported both as the entire word and as each component:
This behavior is desirable since it allows searches to work for both the whole compound word and for components. Here is another instructive example:
Information about text search configuration objects can be obtained inpsqlusing a set of commands:
An optional+
produces more details.
The optional parameterPATTERN
_can be the name of a text search object, optionally schema-qualified. IfPATTERN
is omitted then information about all visible objects will be displayed.PATTERN
_can be a regular expression and can provide_separate_patterns for the schema and object names. The following examples illustrate this:
The available commands are:
\dF[
+
] [
PATTERN
]
List text search configurations (add+
for more detail).
\dFd[
+
] [
PATTERN
]
List text search dictionaries (add+
for more detail).
\dFp[
+
] [
PATTERN
]
List text search parsers (add+
for more detail).
\dFt[
+
] [
PATTERN
]
List text search templates (add+
for more detail).
The behavior of a custom text search configuration can easily become confusing. The functions described in this section are useful for testing text search objects. You can test a complete configuration, or test parsers and dictionaries separately.
The functionts_debug
allows easy testing of a text search configuration.
ts_debug
displays information about every token ofdocument
_as produced by the parser and processed by the configured dictionaries. It uses the configuration specified byconfig
_, ordefault_text_search_config
if that argument is omitted.
ts_debug
returns one row for each token identified in the text by the parser. The columns returned are
aliastext
— short name of the token type
descriptiontext
— description of the token type
tokentext
— text of the token
dictionariesregdictionary[]
— the dictionaries selected by the configuration for this token type
dictionaryregdictionary
— the dictionary that recognized the token, orNULL
if none did
lexemestext[]
— the lexeme(s) produced by the dictionary that recognized the token, orNULL
if none did; an empty array ({}
) means it was recognized as a stop word
Here is a simple example:
For a more extensive demonstration, we first create apublic.english
configuration and Ispell dictionary for the English language:
In this example, the wordBrightest
was recognized by the parser as anASCII word
(aliasasciiword
). For this token type the dictionary list isenglish_ispell
andenglish_stem
. The word was recognized byenglish_ispell
, which reduced it to the nounbright
. The wordsupernovaes
is unknown to theenglish_ispell
dictionary so it was passed to the next dictionary, and, fortunately, was recognized (in fact,english_stem
is a Snowball dictionary which recognizes everything; that is why it was placed at the end of the dictionary list).
You can reduce the width of the output by explicitly specifying which columns you want to see:
The following functions allow direct testing of a text search parser.
ts_parse
parses the given_document
_and returns a series of records, one for each token produced by parsing. Each record includes atokid
showing the assigned token type and atoken
which is the text of the token. For example:
ts_token_type
returns a table which describes each type of token the specified parser can recognize. For each token type, the table gives the integertokid
that the parser uses to label a token of that type, thealias
that names the token type in configuration commands, and a shortdescription
. For example:
Thets_lexize
function facilitates dictionary testing.
ts_lexize
returns an array of lexemes if the input_token
_is known to the dictionary, or an empty array if the token is known to the dictionary but it is a stop word, orNULL
if it is an unknown word.
Examples:
Thets_lexize
function expects a single_token_, not text. Here is a case where this can be confusing:
The thesaurus dictionarythesaurus_astro
does know the phrasesupernovae stars
, butts_lexize
fails since it does not parse the input text but treats it as a single token. Useplainto_tsquery
orto_tsvector
to test thesaurus dictionaries, for example:
The wordThe
was recognized by theenglish_ispell
dictionary as a stop word () and will not be indexed. The spaces are discarded too, since the configuration provides no dictionaries at all for them.
| Word, all ASCII letters |
|
| Word, all letters |
|
| Word, letters and digits |
|
| Hyphenated word, all ASCII |
|
| Hyphenated word, all letters |
|
| Hyphenated word, letters and digits |
|
| Hyphenated word part, all ASCII |
|
| Hyphenated word part, all letters |
|
| Hyphenated word part, letters and digits |
|
| Email address |
|
| Protocol head |
|
| URL |
|
| Host |
|
| URL path |
|
| File or path name |
|
| Scientific notation |
|
| Decimal notation |
|
| Signed integer |
|
| Unsigned integer |
|
| Version number |
|
| XML tag |
|
| XML entity |
|
| Space symbols | (any whitespace or punctuation not otherwise recognized) |
The current limitations ofPostgreSQL's text search features are:
The length of each lexeme must be less than 2K bytes
The length of atsvector
(lexemes + positions) must be less than 1 megabyte
The number of lexemes must be less than 264
Position values intsvector
must be greater than 0 and no more than 16,383
The match distance in a<N
>(FOLLOWED BY)tsquery
operator cannot be more than 16,384
No more than 256 positions per lexeme
The number of nodes (lexemes + operators) in atsquery
must be less than 32,768
For comparison, thePostgreSQL8.1 documentation contained 10,441 unique words, a total of 335,420 words, and the most frequent word“postgresql”was mentioned 6,127 times in 655 documents.
Another example — thePostgreSQLmailing list archives contained 910,989 unique words with 57,491,343 lexemes in 461,020 messages.
There are two kinds of indexes that can be used to speed up full text searches. Note that indexes are not mandatory for full text searching, but in cases where a column is searched on a regular basis, an index is usually desirable.
CREATE INDEX
name
ON
table
USING GIN (
column
);
Creates a GIN (Generalized Inverted Index)-based index. The_column
_must be oftsvector
type.
CREATE INDEX
name
ON
table
USING GIST (
column
);
Creates a GiST (Generalized Search Tree)-based index. The_column
_can be oftsvector
ortsquery
type.
GIN indexes are the preferred text search index type. As inverted indexes, they contain an index entry for each word (lexeme), with a compressed list of matching locations. Multi-word searches can find the first match, then use the index to remove rows that are lacking additional words. GIN indexes store only the words (lexemes) oftsvector
values, and not their weight labels. Thus a table row recheck is needed when using a query that involves weights.
A GiST index is_lossy_, meaning that the index might produce false matches, and it is necessary to check the actual table row to eliminate such false matches. (PostgreSQLdoes this automatically when needed.) GiST indexes are lossy because each document is represented in the index by a fixed-length signature. The signature is generated by hashing each word into a single bit in an n-bit string, with all these bits OR-ed together to produce an n-bit document signature. When two words hash to the same bit position there will be a false match. If all words in the query have matches (real or false) then the table row must be retrieved to see if the match is correct.
Lossiness causes performance degradation due to unnecessary fetches of table records that turn out to be false matches. Since random access to table records is slow, this limits the usefulness of GiST indexes. The likelihood of false matches depends on several factors, in particular the number of unique words, so using dictionaries to reduce this number is recommended.
Dictionaries are used to eliminate words that should not be considered in a search (stop words), and to_normalize_words so that different derived forms of the same word will match. A successfully normalized word is called a_lexeme_. Aside from improving search quality, normalization and removal of stop words reduce the size of thetsvector
representation of a document, thereby improving performance. Normalization does not always have linguistic meaning and usually depends on application semantics.
Some examples of normalization:
Linguistic - Ispell dictionaries try to reduce input words to a normalized form; stemmer dictionaries remove word endings
URLlocations can be canonicalized to make equivalent URLs match:
Color names can be replaced by their hexadecimal values, e.g.,red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF
If indexing numbers, we can remove some fractional digits to reduce the range of possible numbers, so for example_3.14_159265359,_3.14_15926,_3.14_will be the same after normalization if only two digits are kept after the decimal point.
A dictionary is a program that accepts a token as input and returns:
an array of lexemes if the input token is known to the dictionary (notice that one token can produce more than one lexeme)
a single lexeme with theTSL_FILTER
flag set, to replace the original token with a new token to be passed to subsequent dictionaries (a dictionary that does this is called a_filtering dictionary_)
an empty array if the dictionary knows the token, but it is a stop word
NULL
if the dictionary does not recognize the input token
PostgreSQLprovides predefined dictionaries for many languages. There are also several predefined templates that can be used to create new dictionaries with custom parameters. Each predefined dictionary template is described below. If no existing template is suitable, it is possible to create new ones; see thecontrib/
area of thePostgreSQLdistribution for examples.
A text search configuration binds a parser together with a set of dictionaries to process the parser's output tokens. For each token type that the parser can return, a separate list of dictionaries is specified by the configuration. When a token of that type is found by the parser, each dictionary in the list is consulted in turn, until some dictionary recognizes it as a known word. If it is identified as a stop word, or if no dictionary recognizes the token, it will be discarded and not indexed or searched for. Normally, the first dictionary that returns a non-NULL
output determines the result, and any remaining dictionaries are not consulted; but a filtering dictionary can replace the given word with a modified word, which is then passed to subsequent dictionaries.
The general rule for configuring a list of dictionaries is to place first the most narrow, most specific dictionary, then the more general dictionaries, finishing with a very general dictionary, like aSnowballstemmer orsimple
, which recognizes everything. For example, for an astronomy-specific search (astro_en
configuration) one could bind token typeasciiword
(ASCII word) to a synonym dictionary of astronomical terms, a general English dictionary and aSnowballEnglish stemmer:
Stop words are words that are very common, appear in almost every document, and have no discrimination value. Therefore, they can be ignored in the context of full text searching. For example, every English text contains words likea
andthe
, so it is useless to store them in an index. However, stop words do affect the positions intsvector
, which in turn affect ranking:
The missing positions 1,2,4 are because of stop words. Ranks calculated for documents with and without stop words are quite different:
It is up to the specific dictionary how it treats stop words. For example,ispell
dictionaries first normalize words and then look at the list of stop words, whileSnowball
stemmers first check the list of stop words. The reason for the different behavior is an attempt to decrease noise.
Thesimple
dictionary template operates by converting the input token to lower case and checking it against a file of stop words. If it is found in the file then an empty array is returned, causing the token to be discarded. If not, the lower-cased form of the word is returned as the normalized lexeme. Alternatively, the dictionary can be configured to report non-stop-words as unrecognized, allowing them to be passed on to the next dictionary in the list.
Here is an example of a dictionary definition using thesimple
template:
Here,english
is the base name of a file of stop words. The file's full name will be$SHAREDIR/tsearch_data/english.stop
, where$SHAREDIR
means thePostgreSQLinstallation's shared-data directory, often/usr/local/share/postgresql
(usepg_config --sharedir
to determine it if you're not sure). The file format is simply a list of words, one per line. Blank lines and trailing spaces are ignored, and upper case is folded to lower case, but no other processing is done on the file contents.
Now we can test our dictionary:
We can also choose to returnNULL
, instead of the lower-cased word, if it is not found in the stop words file. This behavior is selected by setting the dictionary'sAccept
parameter tofalse
. Continuing the example:
With the default setting ofAccept
=true
, it is only useful to place asimple
dictionary at the end of a list of dictionaries, since it will never pass on any token to a following dictionary. Conversely,Accept
=false
is only useful when there is at least one following dictionary.
Most types of dictionaries rely on configuration files, such as files of stop words. These files_must_be stored in UTF-8 encoding. They will be translated to the actual database encoding, if that is different, when they are read into the server.
Normally, a database session will read a dictionary configuration file only once, when it is first used within the session. If you modify a configuration file and want to force existing sessions to pick up the new contents, issue anALTER TEXT SEARCH DICTIONARY
command on the dictionary. This can be a“dummy”update that doesn't actually change any parameter values.
The only parameter required by thesynonym
template isSYNONYMS
, which is the base name of its configuration file —my_synonyms
in the above example. The file's full name will be$SHAREDIR/tsearch_data/my_synonyms.syn
(where$SHAREDIR
means thePostgreSQLinstallation's shared-data directory). The file format is just one line per word to be substituted, with the word followed by its synonym, separated by white space. Blank lines and trailing spaces are ignored.
Thesynonym
template also has an optional parameterCaseSensitive
, which defaults tofalse
. WhenCaseSensitive
isfalse
, words in the synonym file are folded to lower case, as are input tokens. When it istrue
, words and tokens are not folded to lower case, but are compared as-is.
Then we will get these results:
A thesaurus dictionary (sometimes abbreviated asTZ) is a collection of words that includes information about the relationships of words and phrases, i.e., broader terms (BT), narrower terms (NT), preferred terms, non-preferred terms, related terms, etc.
Basically a thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally, preserves the original terms for indexing as well.PostgreSQL's current implementation of the thesaurus dictionary is an extension of the synonym dictionary with added_phrase_support. A thesaurus dictionary requires a configuration file of the following format:
where the colon (:
) symbol acts as a delimiter between a phrase and its replacement.
A thesaurus dictionary uses a_subdictionary_(which is specified in the dictionary's configuration) to normalize the input text before checking for phrase matches. It is only possible to select one subdictionary. An error is reported if the subdictionary fails to recognize a word. In that case, you should remove the use of the word or teach the subdictionary about it. You can place an asterisk (*
) at the beginning of an indexed word to skip applying the subdictionary to it, but all sample words_must_be known to the subdictionary.
The thesaurus dictionary chooses the longest match if there are multiple phrases matching the input, and ties are broken by using the last definition.
Specific stop words recognized by the subdictionary cannot be specified; instead use?
to mark the location where any stop word can appear. For example, assuming thata
andthe
are stop words according to the subdictionary:
matchesa one the two
andthe one a two
; both would be replaced byswsw
.
Since a thesaurus dictionary has the capability to recognize phrases it must remember its state and interact with the parser. A thesaurus dictionary uses these assignments to check if it should handle the next word or stop accumulation. The thesaurus dictionary must be configured carefully. For example, if the thesaurus dictionary is assigned to handle only theasciiword
token, then a thesaurus dictionary definition likeone 7
will not work since token typeuint
is not assigned to the thesaurus dictionary.
Thesauruses are used during indexing so any change in the thesaurus dictionary's parameters_requires_reindexing. For most other dictionary types, small changes such as adding or removing stopwords does not force reindexing.
To define a new thesaurus dictionary, use thethesaurus
template. For example:
Here:
thesaurus_simple
is the new dictionary's name
mythesaurus
is the base name of the thesaurus configuration file. (Its full name will be$SHAREDIR/tsearch_data/mythesaurus.ths
, where$SHAREDIR
means the installation shared-data directory.)
pg_catalog.english_stem
is the subdictionary (here, a Snowball English stemmer) to use for thesaurus normalization. Notice that the subdictionary will have its own configuration (for example, stop words), which is not shown here.
Now it is possible to bind the thesaurus dictionarythesaurus_simple
to the desired token types in a configuration, for example:
Consider a simple astronomical thesaurusthesaurus_astro
, which contains some astronomical word combinations:
Below we create a dictionary and bind some token types to an astronomical thesaurus and English stemmer:
Now we can see how it works.ts_lexize
is not very useful for testing a thesaurus, because it treats its input as a single token. Instead we can useplainto_tsquery
andto_tsvector
which will break their input strings into multiple tokens:
In principle, one can useto_tsquery
if you quote the argument:
Notice thatsupernova star
matchessupernovae stars
inthesaurus_astro
because we specified theenglish_stem
stemmer in the thesaurus definition. The stemmer removed thee
ands
.
To index the original phrase as well as the substitute, just include it in the right-hand part of the definition:
TheIspelldictionary template supports_morphological dictionaries_, which can normalize many different linguistic forms of a word into the same lexeme. For example, an EnglishIspelldictionary can match all declensions and conjugations of the search termbank
, e.g.,banking
,banked
,banks
,banks'
, andbank's
.
To create anIspelldictionary perform these steps:
download dictionary configuration files.OpenOfficeextension files have the.oxt
extension. It is necessary to extract.aff
and.dic
files, change extensions to.affix
and.dict
. For some dictionary files it is also needed to convert characters to the UTF-8 encoding with commands (for example, for a Norwegian language dictionary):
copy files to the$SHAREDIR/tsearch_data
directory
load files into PostgreSQL with the following command:
Here,DictFile
,AffFile
, andStopWords
specify the base names of the dictionary, affixes, and stop-words files. The stop-words file has the same format explained above for thesimple
dictionary type. The format of the other files is not specified here but is available from the above-mentioned web sites.
Ispell dictionaries usually recognize a limited set of words, so they should be followed by another broader dictionary; for example, a Snowball dictionary, which recognizes everything.
The.affix
file ofIspellhas the following structure:
And the.dict
file has the following structure:
Format of the.dict
file is:
In the.affix
file every affix flag is described in the following format:
Here, condition has a format similar to the format of regular expressions. It can use groupings[...]
and[^...]
. For example,[AEIOU]Y
means that the last letter of the word is"y"
and the penultimate letter is"a"
,"e"
,"i"
,"o"
or"u"
.[^EY]
means that the last letter is neither"e"
nor"y"
.
Ispell dictionaries support splitting compound words; a useful feature. Notice that the affix file should specify a special flag using thecompoundwords controlled
statement that marks dictionary words that can participate in compound formation:
Here are some examples for the Norwegian language:
MySpellformat is a subset ofHunspell. The.affix
file ofHunspellhas the following structure:
The first line of an affix class is the header. Fields of an affix rules are listed after the header:
parameter name (PFX or SFX)
flag (name of the affix class)
stripping characters from beginning (at prefix) or end (at suffix) of the word
adding affix
condition that has a format similar to the format of regular expressions.
The.dict
file looks like the.dict
file ofIspell:
MySpelldoes not support compound words.Hunspellhas sophisticated support for compound words. At present,PostgreSQLimplements only the basic compound word operations of Hunspell.
The stopword file format is the same as already explained.
ASnowballdictionary recognizes everything, whether or not it is able to simplify the word, so it should be placed at the end of the dictionary list. It is useless to have it before any other dictionary because a token will never pass through it to the next dictionary.
Note thatGINindex build time can often be improved by increasing, whileGiSTindex build time is not sensitive to that parameter.
Partitioning of big collections and the proper use of GIN and GiST indexes allows the implementation of very fast searches with online update. Partitioning can be done at the database level using table inheritance, or by distributing documents over servers and collecting search results using themodule. The latter is possible because ranking functions use only local information.
A filtering dictionary can be placed anywhere in the list, except at the end where it'd be useless. Filtering dictionaries are useful to partially normalize words to simplify the task of later dictionaries. For example, a filtering dictionary could be used to remove accents from accented letters, as is done by themodule.
This dictionary template is used to create dictionaries that replace a word with a synonym. Phrases are not supported (use the thesaurus template () for that). A synonym dictionary can be used to overcome linguistic problems, for example, to prevent an English stemmer dictionary from reducing the word“Paris”to“pari”. It is enough to have aParis paris
line in the synonym dictionary and put it before theenglish_stem
dictionary. For example:
An asterisk (*
) can be placed at the end of a synonym in the configuration file. This indicates that the synonym is a prefix. The asterisk is ignored when the entry is used into_tsvector()
, but when it is used into_tsquery()
, the result will be a query item with the prefix match marker (see). For example, suppose we have these entries in$SHAREDIR/tsearch_data/synonym_sample.syn
:
The standardPostgreSQLdistribution does not include anyIspellconfiguration files. Dictionaries for a large number of languages are available from. Also, some more modern dictionary file formats are supported —(OO < 2.0.1) and(OO >= 2.0.2). A large list of dictionaries is available on the.
TheSnowballdictionary template is based on a project by Martin Porter, inventor of the popular Porter's stemming algorithm for the English language. Snowball now provides stemming algorithms for many languages (see thefor more information). Each algorithm understands how to reduce common variant forms of words to a base, or stem, spelling within its language. A Snowball dictionary requires alanguage
parameter to identify which stemmer to use, and optionally can specify astopword
file name that gives a list of words to eliminate. (PostgreSQL's standard stopword lists are also provided by the Snowball project.) For example, there is a built-in definition equivalent to