1 of 12

12. 全文檢索

12.1. 簡介

Full Text Searching (or just_text search_) provides the capability to identify natural-language_documents_that satisfy a_query_, and optionally to sort them by relevance to the query. The most common type of search is to find all documents containing given_query terms_and return them in order of their_similarity_to the query. Notions ofqueryandsimilarityare very flexible and depend on the specific application. The simplest search considersqueryas a set of words andsimilarityas the frequency of query words in the document.

Textual search operators have existed in databases for years.PostgreSQLhas~,~*,LIKE, andILIKEoperators for textual data types, but they lack many essential properties required by modern information systems:

There is no linguistic support, even for English. Regular expressions are not sufficient because they cannot easily handle derived words, e.g.,satisfiesandsatisfy. You might miss documents that containsatisfies, although you probably would like to find them when searching forsatisfy. It is possible to useORto search for multiple derived forms, but this is tedious and error-prone (some words can have several thousand derivatives).
They provide no ordering (ranking) of search results, which makes them ineffective when thousands of matching documents are found.
They tend to be slow because there is no index support, so they must process all documents for every search.

Full text indexing allows documents to be_preprocessed_and an index saved for later rapid searching. Preprocessing includes:

Parsing documents intotokens. It is useful to identify various classes of tokens, e.g., numbers, words, complex words, email addresses, so that they can be processed differently. In principle token classes depend on the specific application, but for most purposes it is adequate to use a predefined set of classes.PostgreSQLuses a_parser_to perform this step. A standard parser is provided, and custom parsers can be created for specific needs.
Converting tokens intolexemes. A lexeme is a string, just like a token, but it has been_normalized_so that different forms of the same word are made alike. For example, normalization almost always includes folding upper-case letters to lower-case, and often involves removal of suffixes (such as_soresin English). This allows searches to find variant forms of the same word, without tediously entering all the possible variants. Also, this step typically eliminates_stop words, which are words that are so common that they are useless for searching. (In short, then, tokens are raw fragments of the document text, while lexemes are words that are believed useful for indexing and searching.)PostgreSQLuses_dictionaries_to perform this step. Various standard dictionaries are provided, and custom ones can be created for specific needs.
Storing preprocessed documents optimized for searching. For example, each document can be represented as a sorted array of normalized lexemes. Along with the lexemes it is often desirable to store positional information to use for_proximity ranking_, so that a document that contains a more“dense”region of query words is assigned a higher rank than one with scattered query words.

Dictionaries allow fine-grained control over how tokens are normalized. With appropriate dictionaries, you can:

Define stop words that should not be indexed.
Map synonyms to a single word usingIspell.
Map phrases to a single word using a thesaurus.
Map different variations of a word to a canonical form using anIspelldictionary.
Map different variations of a word to a canonical form usingSnowballstemmer rules.

A data typetsvectoris provided for storing preprocessed documents, along with a typetsqueryfor representing processed queries (). There are many functions and operators available for these data types (), the most important of which is the match operator@@, which we introduce in. Full text searches can be accelerated using indexes ().

12.1.1. What Is a Document?

A_document_is the unit of searching in a full text search system; for example, a magazine article or email message. The text search engine must be able to parse documents and store associations of lexemes (key words) with their parent document. Later, these associations are used to search for documents that contain query words.

For searches withinPostgreSQL, a document is normally a textual field within a row of a database table, or possibly a combination (concatenation) of such fields, perhaps stored in several tables or obtained dynamically. In other words, a document can be constructed from different parts for indexing and it might not be stored anywhere as a whole. For example:

Note

Actually, in these example queries,coalesceshould be used to prevent a singleNULLattribute from causing aNULLresult for the whole document.

Another possibility is to store the documents as simple text files in the file system. In this case, the database can be used to store the full text index and to execute searches, and some unique identifier can be used to retrieve the document from the file system. However, retrieving files from outside the database requires superuser permissions or special function support, so this is usually less convenient than keeping all the data insidePostgreSQL. Also, keeping everything inside the database allows easy access to document metadata to assist in indexing and display.

For text search purposes, each document must be reduced to the preprocessedtsvectorformat. Searching and ranking are performed entirely on thetsvectorrepresentation of a document — the original text need only be retrieved when the document has been selected for display to a user. We therefore often speak of thetsvectoras being the document, but of course it is only a compact representation of the full document.

12.1.2. Basic Text Matching

Full text searching inPostgreSQLis based on the match operator@@, which returnstrueif atsvector(document) matches atsquery(query). It doesn't matter which data type is written first:

Observe that this match would not succeed if written as

since here no normalization of the wordratswill occur. The elements of atsvectorare lexemes, which are assumed already normalized, soratsdoes not matchrat.

The@@operator also supportstextinput, allowing explicit conversion of a text string totsvectorortsqueryto be skipped in simple cases. The variants available are:

The first two of these we saw already. The formtext@@tsqueryis equivalent toto_tsvector(x) @@ y. The formtext@@textis equivalent toto_tsvector(x) @@ plainto_tsquery(y).

Within atsquery, the&(AND) operator specifies that both its arguments must appear in the document to have a match. Similarly, the|(OR) operator specifies that at least one of its arguments must appear, while the!(NOT) operator specifies that its argument must_not_appear in order to have a match. For example, the queryfat & ! ratmatches documents that containfatbut notrat.

Searching for phrases is possible with the help of the<->(FOLLOWED BY)tsqueryoperator, which matches only if its arguments have matches that are adjacent and in the given order. For example:

There is a more general version of the FOLLOWED BY operator having the form<N>, where_N_is an integer standing for the difference between the positions of the matching lexemes.<1>is the same as<->, while<2>allows exactly one other lexeme to appear between the matches, and so on. Thephraseto_tsqueryfunction makes use of this operator to construct atsquerythat can match a multi-word phrase when some of the words are stop words. For example:

A special case that's sometimes useful is that<0>can be used to require that two patterns match the same word.

Parentheses can be used to control nesting of thetsqueryoperators. Without parentheses,|binds least tightly, then&, then<->, and!most tightly.

It's worth noticing that the AND/OR/NOT operators mean something subtly different when they are within the arguments of a FOLLOWED BY operator than when they are not, because within FOLLOWED BY the exact position of the match is significant. For example, normally!xmatches only documents that do not containxanywhere. But!x <-> ymatchesyif it is not immediately after anx; an occurrence ofxelsewhere in the document does not prevent a match. Another example is thatx & ynormally only requires thatxandyboth appear somewhere in the document, but(x & y) <-> zrequiresxandyto match at the same place, immediately before az. Thus this query behaves differently fromx <-> z & y <-> z, which will match a document containing two separate sequencesx zandy z. (This specific query is useless as written, sincexandycould not match at the same place; but with more complex situations such as prefix-match patterns, a query of this form could be useful.)

12.1.3. Configurations

The above are all simple text search examples. As mentioned before, full text search functionality includes the ability to do many more things: skip indexing certain words (stop words), process synonyms, and use sophisticated parsing, e.g., parse based on more than just white space. This functionality is controlled by_text search configurations_.PostgreSQLcomes with predefined configurations for many languages, and you can easily create your own configurations. (psql's\dFcommand shows all available configurations.)

Each text search function that depends on a configuration has an optionalregconfigargument, so that the configuration to use can be specified explicitly.default_text_search_configis used only when this argument is omitted.

To make it easier to build custom text search configurations, a configuration is built up from simpler database objects.PostgreSQL's text search facility provides four types of configuration-related database objects:

_Text search parsers_break documents into tokens and classify each token (for example, as words or numbers).
_Text search dictionaries_convert tokens to normalized form and reject stop words.
_Text search templates_provide the functions underlying dictionaries. (A dictionary simply specifies a template and a set of parameters for the template.)
_Text search configurations_select a parser and a set of dictionaries to use to normalize the tokens produced by the parser.

Text search parsers and templates are built from low-level C functions; therefore it requires C programming ability to develop new ones, and superuser privileges to install one into a database. (There are examples of add-on parsers and templates in thecontrib/area of thePostgreSQLdistribution.) Since dictionaries and configurations just parameterize and connect together some underlying parsers and templates, no special privilege is needed to create a new dictionary or configuration. Examples of creating custom dictionaries and configurations appear later in this chapter.

12.2. 查詢與索引

The examples in the previous section illustrated full text matching using simple constant strings. This section shows how to search table data, optionally using indexes.

12.2.1. Searching a Table

It is possible to do a full text search without an index. A simple query to print thetitleof each row that contains the wordfriendin itsbodyfield is:

This will also find related words such asfriendsandfriendly, since all these are reduced to the same normalized lexeme.

The query above specifies that theenglishconfiguration is to be used to parse and normalize the strings. Alternatively we could omit the configuration parameters:

This query will use the configuration set by.

A more complex example is to select the ten most recent documents that containcreateandtablein thetitleorbody:

For clarity we omitted thecoalescefunction calls which would be needed to find rows that containNULLin one of the two fields.

Although these queries will work without an index, most applications will find this approach too slow, except perhaps for occasional ad-hoc searches. Practical use of text searching usually requires creating an index.

12.2.2. Creating Indexes

Because the two-argument version ofto_tsvectorwas used in the index above, only a query reference that uses the 2-argument version ofto_tsvectorwith the same configuration name will use that index. That is,WHERE to_tsvector('english', body) @@ 'a & b'can use the index, butWHERE to_tsvector(body) @@ 'a & b'cannot. This ensures that an index will be used only with the same configuration used to create the index entries.

It is possible to set up more complex expression indexes wherein the configuration name is specified by another column, e.g.:

whereconfig_nameis a column in thepgwebtable. This allows mixed configurations in the same index while recording which configuration was used for each index entry. This would be useful, for example, if the document collection contained documents in different languages. Again, queries that are meant to use the index must be phrased to match, e.g.,WHERE to_tsvector(config_name, body) @@ 'a & b'.

Indexes can even concatenate columns:

Another approach is to create a separatetsvectorcolumn to hold the output ofto_tsvector. This example is a concatenation oftitleandbody, usingcoalesceto ensure that one field will still be indexed when the other isNULL:

Then we create aGINindex to speed up the search:

Now we are ready to perform a fast full text search:

12.3. 細部控制

To implement full text searching there must be a function to create atsvectorfrom a document and atsqueryfrom a user query. Also, we need to return results in a useful order, so we need a function that compares documents with respect to their relevance to the query. It's also important to be able to display the results nicely.PostgreSQLprovides support for all of these functions.

12.3.1. Parsing Documents

PostgreSQLprovides the functionto_tsvectorfor converting a document to thetsvectordata type.

to_tsvector([
config
regconfig
, 
] 
document
text
) returns 
tsvector

to_tsvectorparses a textual document into tokens, reduces the tokens to lexemes, and returns atsvectorwhich lists the lexemes together with their positions in the document. The document is processed according to the specified or default text search configuration. Here is a simple example:

SELECT to_tsvector('english', 'a fat  cat sat on a mat - it ate a fat rats');
                  to_tsvector
-----------------------------------------------------
 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4

In the example above we see that the resultingtsvectordoes not contain the wordsa,on, orit, the wordratsbecamerat, and the punctuation sign-was ignored.

Theto_tsvectorfunction internally calls a parser which breaks the document text into tokens and assigns a type to each token. For each token, a list of dictionaries (Section 12.6) is consulted, where the list can vary depending on the token type. The first dictionary that_recognizes_the token emits one or more normalized_lexemes_to represent the token. For example,ratsbecameratbecause one of the dictionaries recognized that the wordratsis a plural form ofrat. Some words are recognized as_stop words_(Section 12.6.1), which causes them to be ignored since they occur too frequently to be useful in searching. In our example these area,on, andit. If no dictionary in the list recognizes the token then it is also ignored. In this example that happened to the punctuation sign-because there are in fact no dictionaries assigned for its token type (Space symbols), meaning space tokens will never be indexed. The choices of parser, dictionaries and which types of tokens to index are determined by the selected text search configuration (Section 12.7). It is possible to have many different configurations in the same database, and predefined configurations are available for various languages. In our example we used the default configurationenglishfor the English language.

The functionsetweightcan be used to label the entries of atsvectorwith a given_weight_, where a weight is one of the lettersA,B,C, orD. This is typically used to mark entries coming from different parts of a document, such as title versus body. Later, this information can be used for ranking of search results.

Becauseto_tsvector(NULL) will returnNULL, it is recommended to usecoalescewhenever a field might be null. Here is the recommended method for creating atsvectorfrom a structured document:

UPDATE tt SET ti =
    setweight(to_tsvector(coalesce(title,'')), 'A')    ||
    setweight(to_tsvector(coalesce(keyword,'')), 'B')  ||
    setweight(to_tsvector(coalesce(abstract,'')), 'C') ||
    setweight(to_tsvector(coalesce(body,'')), 'D');

Here we have usedsetweightto label the source of each lexeme in the finishedtsvector, and then merged the labeledtsvectorvalues using thetsvectorconcatenation operator||. (Section 12.4.1gives details about these operations.)

12.3.2. Parsing Queries

PostgreSQLprovides the functionsto_tsquery,plainto_tsquery, andphraseto_tsqueryfor converting a query to thetsquerydata type.to_tsqueryoffers access to more features than eitherplainto_tsqueryorphraseto_tsquery, but it is less forgiving about its input.

to_tsquery([
config
regconfig
, 
] 
querytext
text
) returns 
tsquery

to_tsquerycreates atsqueryvalue fromquerytext, which must consist of single tokens separated by thetsqueryoperators&(AND),|(OR),!(NOT), and<->(FOLLOWED BY), possibly grouped using parentheses. In other words, the input toto_tsquerymust already follow the general rules fortsqueryinput, as described inSection 8.11.2. The difference is that while basictsqueryinput takes the tokens at face value,to_tsquerynormalizes each token into a lexeme using the specified or default configuration, and discards any tokens that are stop words according to the configuration. For example:

SELECT to_tsquery('english', 'The 
&
 Fat 
&
 Rats');
  to_tsquery   
---------------
 'fat' 
&
 'rat'

As in basictsqueryinput, weight(s) can be attached to each lexeme to restrict it to match onlytsvectorlexemes of those weight(s). For example:

SELECT to_tsquery('english', 'Fat | Rats:AB');
    to_tsquery    
------------------
 'fat' | 'rat':AB

Also,*can be attached to a lexeme to specify prefix matching:

SELECT to_tsquery('supern:*A 
&
 star:A*B');
        to_tsquery        
--------------------------
 'supern':*A 
&
 'star':*AB

Such a lexeme will match any word in atsvectorthat begins with the given string.

to_tsquerycan also accept single-quoted phrases. This is primarily useful when the configuration includes a thesaurus dictionary that may trigger on such phrases. In the example below, a thesaurus contains the rulesupernovae stars : sn:

SELECT to_tsquery('''supernovae stars'' 
&
 !crab');
  to_tsquery
---------------
 'sn' 
&
 !'crab'

Without quotes,to_tsquerywill generate a syntax error for tokens that are not separated by an AND, OR, or FOLLOWED BY operator.

plainto_tsquery([
config
regconfig
, 
] 
querytext
text
) returns 
tsquery

plainto_tsquerytransforms the unformatted text_querytext_to atsqueryvalue. The text is parsed and normalized much as forto_tsvector, then the&(AND)tsqueryoperator is inserted between surviving words.

Example:

SELECT plainto_tsquery('english', 'The Fat Rats');
 plainto_tsquery 
-----------------
 'fat' 
&
 'rat'

Note thatplainto_tsquerywill not recognizetsqueryoperators, weight labels, or prefix-match labels in its input:

SELECT plainto_tsquery('english', 'The Fat 
&
 Rats:C');
   plainto_tsquery   
---------------------
 'fat' 
&
 'rat' 
&
 'c'

Here, all the input punctuation was discarded as being space symbols.

phraseto_tsquery([
config
regconfig
, 
] 
querytext
text
) returns 
tsquery

phraseto_tsquerybehaves much likeplainto_tsquery, except that it inserts the<->(FOLLOWED BY) operator between surviving words instead of the&(AND) operator. Also, stop words are not simply discarded, but are accounted for by inserting<N>operators rather than<->operators. This function is useful when searching for exact lexeme sequences, since the FOLLOWED BY operators check lexeme order not just the presence of all the lexemes.

Example:

SELECT phraseto_tsquery('english', 'The Fat Rats');
 phraseto_tsquery
------------------
 'fat' 
<
-
>
 'rat'

Likeplainto_tsquery, thephraseto_tsqueryfunction will not recognizetsqueryoperators, weight labels, or prefix-match labels in its input:

SELECT phraseto_tsquery('english', 'The Fat 
&
 Rats:C');
      phraseto_tsquery
-----------------------------
 'fat' 
<
-
>
 'rat' 
<
-
>
 'c'

12.3.3. Ranking Search Results

Ranking attempts to measure how relevant documents are to a particular query, so that when there are many matches the most relevant ones can be shown first.PostgreSQLprovides two predefined ranking functions, which take into account lexical, proximity, and structural information; that is, they consider how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur. However, the concept of relevancy is vague and very application-specific. Different applications might require additional information for ranking, e.g., document modification time. The built-in ranking functions are only examples. You can write your own ranking functions and/or combine their results with additional factors to fit your specific needs.

The two ranking functions currently available are:

ts_rank([

weights

float4[]

]

vector

tsvector

query

tsquery

[

normalization

integer

]) returns

float4

Ranks vectors based on the frequency of their matching lexemes.

ts_rank_cd([

weights

float4[]

]

vector

tsvector

query

tsquery

[

normalization

integer

]) returns

float4

This function computes the_cover density_ranking for the given document vector and query, as described in Clarke, Cormack, and Tudhope's "Relevance Ranking for One to Three Term Queries" in the journal "Information Processing and Management", 1999. Cover density is similar tots_rankranking except that the proximity of matching lexemes to each other is taken into consideration.

This function requires lexeme positional information to perform its calculation. Therefore, it ignores any“stripped”lexemes in thetsvector. If there are no unstripped lexemes in the input, the result will be zero. (SeeSection 12.4.1for more information about thestripfunction and positional information intsvectors.)

For both these functions, the optional_weights_argument offers the ability to weigh word instances more or less heavily depending on how they are labeled. The weight arrays specify how heavily to weigh each category of word, in the order:

{D-weight, C-weight, B-weight, A-weight}

If no_weights_are provided, then these defaults are used:

{0.1, 0.2, 0.4, 1.0}

Typically weights are used to mark words from special areas of the document, like the title or an initial abstract, so they can be treated with more or less importance than words in the document body.

Since a longer document has a greater chance of containing a query term it is reasonable to take into account document size, e.g., a hundred-word document with five instances of a search word is probably more relevant than a thousand-word document with five instances. Both ranking functions take an integer_normalization_option that specifies whether and how a document's length should impact its rank. The integer option controls several behaviors, so it is a bit mask: you can specify one or more behaviors using|(for example,2|4).

0 (the default) ignores the document length
1 divides the rank by 1 + the logarithm of the document length
2 divides the rank by the document length
4 divides the rank by the mean harmonic distance between extents (this is implemented only byts_rank_cd)
8 divides the rank by the number of unique words in document
16 divides the rank by 1 + the logarithm of the number of unique words in document
32 divides the rank by itself + 1

If more than one flag bit is specified, the transformations are applied in the order listed.

It is important to note that the ranking functions do not use any global information, so it is impossible to produce a fair normalization to 1% or 100% as sometimes desired. Normalization option 32 (rank/(rank+1)) can be applied to scale all ranks into the range zero to one, but of course this is just a cosmetic change; it will not affect the ordering of the search results.

Here is an example that selects only the ten highest-ranked matches:

SELECT title, ts_rank_cd(textsearch, query) AS rank
FROM apod, to_tsquery('neutrino|(dark 
&
 matter)') query
WHERE query @@ textsearch
ORDER BY rank DESC
LIMIT 10;
                     title                     |   rank
-----------------------------------------------+----------
 Neutrinos in the Sun                          |      3.1
 The Sudbury Neutrino Detector                 |      2.4
 A MACHO View of Galactic Dark Matter          |  2.01317
 Hot Gas and Dark Matter                       |  1.91171
 The Virgo Cluster: Hot Plasma and Dark Matter |  1.90953
 Rafting for Solar Neutrinos                   |      1.9
 NGC 4650A: Strange Galaxy and Dark Matter     |  1.85774
 Hot Gas and Dark Matter                       |   1.6123
 Ice Fishing for Cosmic Neutrinos              |      1.6
 Weak Lensing Distorts the Universe            | 0.818218

This is the same example using normalized ranking:

SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank
FROM apod, to_tsquery('neutrino|(dark 
&
 matter)') query
WHERE  query @@ textsearch
ORDER BY rank DESC
LIMIT 10;
                     title                     |        rank
-----------------------------------------------+-------------------
 Neutrinos in the Sun                          | 0.756097569485493
 The Sudbury Neutrino Detector                 | 0.705882361190954
 A MACHO View of Galactic Dark Matter          | 0.668123210574724
 Hot Gas and Dark Matter                       |  0.65655958650282
 The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
 Rafting for Solar Neutrinos                   | 0.655172410958162
 NGC 4650A: Strange Galaxy and Dark Matter     | 0.650072921219637
 Hot Gas and Dark Matter                       | 0.617195790024749
 Ice Fishing for Cosmic Neutrinos              | 0.615384618911517
 Weak Lensing Distorts the Universe            | 0.450010798361481

Ranking can be expensive since it requires consulting thetsvectorof each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches.

12.3.4. Highlighting Results

To present search results it is ideal to show a part of each document and how it is related to the query. Usually, search engines show fragments of the document with marked search terms.PostgreSQLprovides a functionts_headlinethat implements this functionality.

ts_headline([
config
regconfig
, 
] 
document
text
, 
query
tsquery
 [
, 
options
text
]) returns 
text

ts_headlineaccepts a document along with a query, and returns an excerpt from the document in which terms from the query are highlighted. The configuration to be used to parse the document can be specified byconfig; if_config_is omitted, thedefault_text_search_configconfiguration is used.

If anoptions_string is specified it must consist of a comma-separated list of one or moreoption=value_pairs. The available options are:

StartSel,StopSel: the strings with which to delimit query words appearing in the document, to distinguish them from other excerpted words. You must double-quote these strings if they contain spaces or commas.
MaxWords,MinWords: these numbers determine the longest and shortest headlines to output.
ShortWord: words of this length or less will be dropped at the start and end of a headline. The default value of three eliminates common English articles.
HighlightAll: Boolean flag; iftruethe whole document will be used as the headline, ignoring the preceding three parameters.
MaxFragments: maximum number of text excerpts or fragments to display. The default value of zero selects a non-fragment-oriented headline generation method. A value greater than zero selects fragment-based headline generation. This method finds text fragments with as many query words as possible and stretches those fragments around the query words. As a result query words are close to the middle of each fragment and have words on each side. Each fragment will be of at mostMaxWordsand words of lengthShortWordor less are dropped at the start and end of each fragment. If not all query words are found in the document, then a single fragment of the firstMinWordsin the document will be displayed.
FragmentDelimiter: When more than one fragment is displayed, the fragments will be separated by this string.

Any unspecified options receive these defaults:

StartSel=
<
b
>
, StopSel=
<
/b
>
,
MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE,
MaxFragments=0, FragmentDelimiter=" ... "

For example:

SELECT ts_headline('english',
  'The most common type of search
is to find all documents containing given query terms
and return them in order of their similarity to the
query.',
  to_tsquery('query 
&
 similarity'));
                        ts_headline                         
------------------------------------------------------------
 containing given 
<
b
>
query
<
/b
>
 terms
 and return them in order of their 
<
b
>
similarity
<
/b
>
 to the

<
b
>
query
<
/b
>
.

SELECT ts_headline('english',
  'The most common type of search
is to find all documents containing given query terms
and return them in order of their similarity to the
query.',
  to_tsquery('query 
&
 similarity'),
  'StartSel = 
<
, StopSel = 
>
');
                      ts_headline                      
-------------------------------------------------------
 containing given 
<
query
>
 terms
 and return them in order of their 
<
similarity
>
 to the

<
query
>
.

ts_headlineuses the original document, not atsvectorsummary, so it can be slow and should be used with care.

12.4. 延伸功能

This section describes additional functions and operators that are useful in connection with text search.

12.4.1. Manipulating Documents

Section 12.3.1showed how raw textual documents can be converted intotsvectorvalues.PostgreSQLalso provides functions and operators that can be used to manipulate documents that are already intsvectorform.

tsvector

Thetsvectorconcatenation operator returns a vector which combines the lexemes and positional information of the two vectors given as arguments. Positions and weight labels are retained during the concatenation. Positions appearing in the right-hand vector are offset by the largest position mentioned in the left-hand vector, so that the result is nearly equivalent to the result of performingto_tsvectoron the concatenation of the two original document strings. (The equivalence is not exact, because any stop-words removed from the end of the left-hand argument will not affect the result, whereas they would have affected the positions of the lexemes in the right-hand argument if textual concatenation were used.)

One advantage of using concatenation in the vector form, rather than concatenating text before applyingto_tsvector, is that you can use different configurations to parse different sections of the document. Also, because thesetweightfunction marks all lexemes of the given vector the same way, it is necessary to parse the text and dosetweightbefore concatenating if you want to label different parts of the document with different weights.

setweight(

vector

tsvector

weight

"char"

) returns

tsvector

setweightreturns a copy of the input vector in which every position has been labeled with the givenweight, eitherA,B,C, orD. (Dis the default for new vectors and as such is not displayed on output.) These labels are retained when vectors are concatenated, allowing words from different parts of a document to be weighted differently by ranking functions.

Note that weight labels apply to_positions_, not_lexemes_. If the input vector has been stripped of positions thensetweightdoes nothing.

length(

vector

tsvector

) returns

integer

Returns the number of lexemes stored in the vector.

strip(

vector

tsvector

) returns

tsvector

Returns a vector that lists the same lexemes as the given vector, but lacks any position or weight information. The result is usually much smaller than an unstripped vector, but it is also less useful. Relevance ranking does not work as well on stripped vectors as unstripped ones. Also, the<->(FOLLOWED BY)tsqueryoperator will never match stripped input, since it cannot determine the distance between lexeme occurrences.

A full list oftsvector-related functions is available inTable 9.41.

12.4.2. Manipulating Queries

Section 12.3.2showed how raw textual queries can be converted intotsqueryvalues.PostgreSQLalso provides functions and operators that can be used to manipulate queries that are already intsqueryform.

tsquery

Returns the AND-combination of the two given queries.

tsquery

Returns the OR-combination of the two given queries.

!!

tsquery

Returns the negation (NOT) of the given query.

tsquery

Returns a query that searches for a match to the first given query immediately followed by a match to the second given query, using the<->(FOLLOWED BY)tsqueryoperator. For example:

SELECT to_tsquery('fat') 
<
-
>
 to_tsquery('cat | rat');
             ?column?
-----------------------------------
 'fat' 
<
-
>
 'cat' | 'fat' 
<
-
>
 'rat'

tsquery_phrase(

query1

tsquery

query2

tsquery

distance

integer

]) returns

tsquery

Returns a query that searches for a match to the first given query followed by a match to the second given query at a distance of atdistance_lexemes, using the<N_>tsqueryoperator. For example:

SELECT tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'), 10);
  tsquery_phrase
------------------
 'fat' 
<
10
>
 'cat'

numnode(

query

tsquery

) returns

integer

Returns the number of nodes (lexemes plus operators) in atsquery. This function is useful to determine if the_query_is meaningful (returns > 0), or contains only stop words (returns 0). Examples:

SELECT numnode(plainto_tsquery('the any'));
NOTICE:  query contains only stopword(s) or doesn't contain lexeme(s), ignored
 numnode
---------
       0

SELECT numnode('foo 
&
 bar'::tsquery);
 numnode
---------
       3

querytree(

query

tsquery

) returns

text

Returns the portion of atsquerythat can be used for searching an index. This function is useful for detecting unindexable queries, for example those containing only stop words or only negated terms. For example:

SELECT querytree(to_tsquery('!defined'));
 querytree
-----------

12.4.2.1. Query Rewriting

Thets_rewritefamily of functions search a giventsqueryfor occurrences of a target subquery, and replace each occurrence with a substitute subquery. In essence this operation is atsquery-specific version of substring replacement. A target and substitute combination can be thought of as a_query rewrite rule_. A collection of such rewrite rules can be a powerful search aid. For example, you can expand the search using synonyms (e.g.,new york,big apple,nyc,gotham) or narrow the search to direct the user to some hot topic. There is some overlap in functionality between this feature and thesaurus dictionaries (Section 12.6.4). However, you can modify a set of rewrite rules on-the-fly without reindexing, whereas updating a thesaurus requires reindexing to be effective.

ts_rewrite (

query

tsquery

target

tsquery

substitute

tsquery

) returns

tsquery

This form ofts_rewritesimply applies a single rewrite rule:target_is replaced bysubstitutewherever it appears inquery_. For example:

SELECT ts_rewrite('a 
&
 b'::tsquery, 'a'::tsquery, 'c'::tsquery);
 ts_rewrite
------------
 'b' 
&
 'c'

ts_rewrite (

query

tsquery

select

text

) returns

tsquery

This form ofts_rewriteaccepts a startingquery_and a SQLselectcommand, which is given as a text string. Theselectmust yield two columns oftsquerytype. For each row of theselectresult, occurrences of the first column value (the target) are replaced by the second column value (the substitute) within the currentquery_value. For example:

CREATE TABLE aliases (t tsquery PRIMARY KEY, s tsquery);
INSERT INTO aliases VALUES('a', 'c');

SELECT ts_rewrite('a 
&
 b'::tsquery, 'SELECT t,s FROM aliases');
 ts_rewrite
------------
 'b' 
&
 'c'

Note that when multiple rewrite rules are applied in this way, the order of application can be important; so in practice you will want the source query toORDER BYsome ordering key.

Let's consider a real-life astronomical example. We'll expand querysupernovaeusing table-driven rewriting rules:

CREATE TABLE aliases (t tsquery primary key, s tsquery);
INSERT INTO aliases VALUES(to_tsquery('supernovae'), to_tsquery('supernovae|sn'));

SELECT ts_rewrite(to_tsquery('supernovae 
&
 crab'), 'SELECT * FROM aliases');
           ts_rewrite            
---------------------------------
 'crab' 
&
 ( 'supernova' | 'sn' )

We can change the rewriting rules just by updating the table:

UPDATE aliases
SET s = to_tsquery('supernovae|sn 
&
 !nebulae')
WHERE t = to_tsquery('supernovae');

SELECT ts_rewrite(to_tsquery('supernovae 
&
 crab'), 'SELECT * FROM aliases');
                 ts_rewrite                  
---------------------------------------------
 'crab' 
&
 ( 'supernova' | 'sn' 
&
 !'nebula' )

Rewriting can be slow when there are many rewriting rules, since it checks every rule for a possible match. To filter out obvious non-candidate rules we can use the containment operators for thetsquerytype. In the example below, we select only those rules which might match the original query:

SELECT ts_rewrite('a 
&
 b'::tsquery,
                  'SELECT t,s FROM aliases WHERE ''a 
&
 b''::tsquery @
>
 t');
 ts_rewrite
------------
 'b' 
&
 'c'

12.4.3. Triggers for Automatic Updates

When using a separate column to store thetsvectorrepresentation of your documents, it is necessary to create a trigger to update thetsvectorcolumn when the document content columns change. Two built-in trigger functions are available for this, or you can write your own.

tsvector_update_trigger(
tsvector_column_name
, 
config_name
, 
text_column_name
 [
, ... 
])
tsvector_update_trigger_column(
tsvector_column_name
, 
config_column_name
, 
text_column_name
 [
, ... 
])

These trigger functions automatically compute atsvectorcolumn from one or more textual columns, under the control of parameters specified in theCREATE TRIGGERcommand. An example of their use is:

CREATE TABLE messages (
    title       text,
    body        text,
    tsv         tsvector
);

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON messages FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(tsv, 'pg_catalog.english', title, body);

INSERT INTO messages VALUES('title here', 'the body text is here');

SELECT * FROM messages;
   title    |         body          |            tsv             
------------+-----------------------+----------------------------
 title here | the body text is here | 'bodi':4 'text':5 'titl':1

SELECT title, body FROM messages WHERE tsv @@ to_tsquery('title 
&
 body');
   title    |         body          
------------+-----------------------
 title here | the body text is here

Having created this trigger, any change intitleorbodywill automatically be reflected intotsv, without the application having to worry about it.

The first trigger argument must be the name of thetsvectorcolumn to be updated. The second argument specifies the text search configuration to be used to perform the conversion. Fortsvector_update_trigger, the configuration name is simply given as the second trigger argument. It must be schema-qualified as shown above, so that the trigger behavior will not change with changes insearch_path. Fortsvector_update_trigger_column, the second trigger argument is the name of another table column, which must be of typeregconfig. This allows a per-row selection of configuration to be made. The remaining argument(s) are the names of textual columns (of typetext,varchar, orchar). These will be included in the document in the order given. NULL values will be skipped (but the other columns will still be indexed).

A limitation of these built-in triggers is that they treat all the input columns alike. To process columns differently — for example, to weight title differently from body — it is necessary to write a custom trigger. Here is an example usingPL/pgSQLas the trigger language:

CREATE FUNCTION messages_trigger() RETURNS trigger AS $$
begin
  new.tsv :=
     setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'A') ||
     setweight(to_tsvector('pg_catalog.english', coalesce(new.body,'')), 'D');
  return new;
end
$$ LANGUAGE plpgsql;

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
    ON messages FOR EACH ROW EXECUTE PROCEDURE messages_trigger();

Keep in mind that it is important to specify the configuration name explicitly when creatingtsvectorvalues inside triggers, so that the column's contents will not be affected by changes todefault_text_search_config. Failure to do this is likely to lead to problems such as search results changing after a dump and reload.

12.4.4. Gathering Document Statistics

The functionts_statis useful for checking your configuration and for finding stop-word candidates.

ts_stat(
sqlquery
text
, [
weights
text
, 
]
        OUT 
word
text
, OUT 
ndoc
integer
,
        OUT 
nentry
integer
) returns 
setof record

_sqlquery_is a text value containing an SQL query which must return a singletsvectorcolumn.ts_statexecutes the query and returns statistics about each distinct lexeme (word) contained in thetsvectordata. The columns returned are

wordtext— the value of a lexeme
ndocinteger— number of documents (tsvectors) the word occurred in
nentryinteger— total number of occurrences of the word

If_weights_is supplied, only occurrences having one of those weights are counted.

For example, to find the ten most frequent words in a document collection:

SELECT * FROM ts_stat('SELECT vector FROM apod')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;

The same, but counting only word occurrences with weightAorB:

SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;

12.5. 斷詞

Text search parsers are responsible for splitting raw document text into_tokens_and identifying each token's type, where the set of possible types is defined by the parser itself. Note that a parser does not modify the text at all — it simply identifies plausible word boundaries. Because of this limited scope, there is less need for application-specific custom parsers than there is for custom dictionaries. At presentPostgreSQLprovides just one built-in parser, which has been found to be useful for a wide range of applications.

The built-in parser is namedpg_catalog.default. It recognizes 23 token types, shown in.

Table 12.1. Default Parser's Token Types

Alias

Description

Example

Note

The parser's notion of a“letter”is determined by the database's locale setting, specificallylc_ctype. Words containing only the basic ASCII letters are reported as a separate token type, since it is sometimes useful to distinguish them. In most European languages, token typeswordandasciiwordshould be treated alike.

emaildoes not support all valid email characters as defined by RFC 5322. Specifically, the only non-alphanumeric characters supported for email user names are period, dash, and underscore.

It is possible for the parser to produce overlapping tokens from the same piece of text. As an example, a hyphenated word will be reported both as the entire word and as each component:

This behavior is desirable since it allows searches to work for both the whole compound word and for components. Here is another instructive example:

12.6. 字典

Dictionaries are used to eliminate words that should not be considered in a search (stop words), and to_normalize_words so that different derived forms of the same word will match. A successfully normalized word is called a_lexeme_. Aside from improving search quality, normalization and removal of stop words reduce the size of thetsvectorrepresentation of a document, thereby improving performance. Normalization does not always have linguistic meaning and usually depends on application semantics.

Some examples of normalization:

Linguistic - Ispell dictionaries try to reduce input words to a normalized form; stemmer dictionaries remove word endings
URLlocations can be canonicalized to make equivalent URLs match:
Color names can be replaced by their hexadecimal values, e.g.,red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF
If indexing numbers, we can remove some fractional digits to reduce the range of possible numbers, so for example_3.14_159265359,_3.14_15926,_3.14_will be the same after normalization if only two digits are kept after the decimal point.

A dictionary is a program that accepts a token as input and returns:

an array of lexemes if the input token is known to the dictionary (notice that one token can produce more than one lexeme)
a single lexeme with theTSL_FILTERflag set, to replace the original token with a new token to be passed to subsequent dictionaries (a dictionary that does this is called a_filtering dictionary_)
an empty array if the dictionary knows the token, but it is a stop word
NULLif the dictionary does not recognize the input token

PostgreSQLprovides predefined dictionaries for many languages. There are also several predefined templates that can be used to create new dictionaries with custom parameters. Each predefined dictionary template is described below. If no existing template is suitable, it is possible to create new ones; see thecontrib/area of thePostgreSQLdistribution for examples.

A text search configuration binds a parser together with a set of dictionaries to process the parser's output tokens. For each token type that the parser can return, a separate list of dictionaries is specified by the configuration. When a token of that type is found by the parser, each dictionary in the list is consulted in turn, until some dictionary recognizes it as a known word. If it is identified as a stop word, or if no dictionary recognizes the token, it will be discarded and not indexed or searched for. Normally, the first dictionary that returns a non-NULLoutput determines the result, and any remaining dictionaries are not consulted; but a filtering dictionary can replace the given word with a modified word, which is then passed to subsequent dictionaries.

The general rule for configuring a list of dictionaries is to place first the most narrow, most specific dictionary, then the more general dictionaries, finishing with a very general dictionary, like aSnowballstemmer orsimple, which recognizes everything. For example, for an astronomy-specific search (astro_enconfiguration) one could bind token typeasciiword(ASCII word) to a synonym dictionary of astronomical terms, a general English dictionary and aSnowballEnglish stemmer:

ALTER TEXT SEARCH CONFIGURATION astro_en
    ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;

A filtering dictionary can be placed anywhere in the list, except at the end where it'd be useless. Filtering dictionaries are useful to partially normalize words to simplify the task of later dictionaries. For example, a filtering dictionary could be used to remove accents from accented letters, as is done by theunaccentmodule.

12.6.1. Stop Words

Stop words are words that are very common, appear in almost every document, and have no discrimination value. Therefore, they can be ignored in the context of full text searching. For example, every English text contains words likeaandthe, so it is useless to store them in an index. However, stop words do affect the positions intsvector, which in turn affect ranking:

SELECT to_tsvector('english','in the list of stop words');
        to_tsvector
----------------------------
 'list':3 'stop':5 'word':6

The missing positions 1,2,4 are because of stop words. Ranks calculated for documents with and without stop words are quite different:

SELECT ts_rank_cd (to_tsvector('english','in the list of stop words'), to_tsquery('list 
&
 stop'));
 ts_rank_cd
------------
       0.05

SELECT ts_rank_cd (to_tsvector('english','list stop words'), to_tsquery('list 
&
 stop'));
 ts_rank_cd
------------
        0.1

It is up to the specific dictionary how it treats stop words. For example,ispelldictionaries first normalize words and then look at the list of stop words, whileSnowballstemmers first check the list of stop words. The reason for the different behavior is an attempt to decrease noise.

12.6.2. Simple Dictionary

Thesimpledictionary template operates by converting the input token to lower case and checking it against a file of stop words. If it is found in the file then an empty array is returned, causing the token to be discarded. If not, the lower-cased form of the word is returned as the normalized lexeme. Alternatively, the dictionary can be configured to report non-stop-words as unrecognized, allowing them to be passed on to the next dictionary in the list.

Here is an example of a dictionary definition using thesimpletemplate:

CREATE TEXT SEARCH DICTIONARY public.simple_dict (
    TEMPLATE = pg_catalog.simple,
    STOPWORDS = english
);

Here,englishis the base name of a file of stop words. The file's full name will be$SHAREDIR/tsearch_data/english.stop, where$SHAREDIRmeans thePostgreSQLinstallation's shared-data directory, often/usr/local/share/postgresql(usepg_config --sharedirto determine it if you're not sure). The file format is simply a list of words, one per line. Blank lines and trailing spaces are ignored, and upper case is folded to lower case, but no other processing is done on the file contents.

Now we can test our dictionary:

SELECT ts_lexize('public.simple_dict','YeS');
 ts_lexize
-----------
 {yes}

SELECT ts_lexize('public.simple_dict','The');
 ts_lexize
-----------
 {}

We can also choose to returnNULL, instead of the lower-cased word, if it is not found in the stop words file. This behavior is selected by setting the dictionary'sAcceptparameter tofalse. Continuing the example:

ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false );

SELECT ts_lexize('public.simple_dict','YeS');
 ts_lexize
-----------


SELECT ts_lexize('public.simple_dict','The');
 ts_lexize
-----------
 {}

With the default setting ofAccept=true, it is only useful to place asimpledictionary at the end of a list of dictionaries, since it will never pass on any token to a following dictionary. Conversely,Accept=falseis only useful when there is at least one following dictionary.

Caution

Most types of dictionaries rely on configuration files, such as files of stop words. These files_must_be stored in UTF-8 encoding. They will be translated to the actual database encoding, if that is different, when they are read into the server.

Caution

Normally, a database session will read a dictionary configuration file only once, when it is first used within the session. If you modify a configuration file and want to force existing sessions to pick up the new contents, issue anALTER TEXT SEARCH DICTIONARYcommand on the dictionary. This can be a“dummy”update that doesn't actually change any parameter values.

12.6.3. Synonym Dictionary

This dictionary template is used to create dictionaries that replace a word with a synonym. Phrases are not supported (use the thesaurus template (Section 12.6.4) for that). A synonym dictionary can be used to overcome linguistic problems, for example, to prevent an English stemmer dictionary from reducing the word“Paris”to“pari”. It is enough to have aParis parisline in the synonym dictionary and put it before theenglish_stemdictionary. For example:

SELECT * FROM ts_debug('english', 'Paris');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes 
-----------+-----------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari}

CREATE TEXT SEARCH DICTIONARY my_synonym (
    TEMPLATE = synonym,
    SYNONYMS = my_synonyms
);

ALTER TEXT SEARCH CONFIGURATION english
    ALTER MAPPING FOR asciiword
    WITH my_synonym, english_stem;

SELECT * FROM ts_debug('english', 'Paris');
   alias   |   description   | token |       dictionaries        | dictionary | lexemes 
-----------+-----------------+-------+---------------------------+------------+---------
 asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris}

The only parameter required by thesynonymtemplate isSYNONYMS, which is the base name of its configuration file —my_synonymsin the above example. The file's full name will be$SHAREDIR/tsearch_data/my_synonyms.syn(where$SHAREDIRmeans thePostgreSQLinstallation's shared-data directory). The file format is just one line per word to be substituted, with the word followed by its synonym, separated by white space. Blank lines and trailing spaces are ignored.

Thesynonymtemplate also has an optional parameterCaseSensitive, which defaults tofalse. WhenCaseSensitiveisfalse, words in the synonym file are folded to lower case, as are input tokens. When it istrue, words and tokens are not folded to lower case, but are compared as-is.

An asterisk (*) can be placed at the end of a synonym in the configuration file. This indicates that the synonym is a prefix. The asterisk is ignored when the entry is used into_tsvector(), but when it is used into_tsquery(), the result will be a query item with the prefix match marker (seeSection 12.3.2). For example, suppose we have these entries in$SHAREDIR/tsearch_data/synonym_sample.syn:

postgres        pgsql
postgresql      pgsql
postgre pgsql
gogle   googl
indices index*

Then we will get these results:

mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
mydb=# SELECT ts_lexize('syn','indices');
 ts_lexize
-----------
 {index}
(1 row)

mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
mydb=# SELECT to_tsvector('tst','indices');
 to_tsvector
-------------
 'index':1
(1 row)

mydb=# SELECT to_tsquery('tst','indices');
 to_tsquery
------------
 'index':*
(1 row)

mydb=# SELECT 'indexes are very useful'::tsvector;
            tsvector             
---------------------------------
 'are' 'indexes' 'useful' 'very'
(1 row)

mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices');
 ?column?
----------
 t
(1 row)

12.6.4. Thesaurus Dictionary

A thesaurus dictionary (sometimes abbreviated asTZ) is a collection of words that includes information about the relationships of words and phrases, i.e., broader terms (BT), narrower terms (NT), preferred terms, non-preferred terms, related terms, etc.

Basically a thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally, preserves the original terms for indexing as well.PostgreSQL's current implementation of the thesaurus dictionary is an extension of the synonym dictionary with added_phrase_support. A thesaurus dictionary requires a configuration file of the following format:

# this is a comment
sample word(s) : indexed word(s)
more sample word(s) : more indexed word(s)
...

where the colon (:) symbol acts as a delimiter between a phrase and its replacement.

A thesaurus dictionary uses a_subdictionary_(which is specified in the dictionary's configuration) to normalize the input text before checking for phrase matches. It is only possible to select one subdictionary. An error is reported if the subdictionary fails to recognize a word. In that case, you should remove the use of the word or teach the subdictionary about it. You can place an asterisk (*) at the beginning of an indexed word to skip applying the subdictionary to it, but all sample words_must_be known to the subdictionary.

The thesaurus dictionary chooses the longest match if there are multiple phrases matching the input, and ties are broken by using the last definition.

Specific stop words recognized by the subdictionary cannot be specified; instead use?to mark the location where any stop word can appear. For example, assuming thataandtheare stop words according to the subdictionary:

? one ? two : swsw

matchesa one the twoandthe one a two; both would be replaced byswsw.

Since a thesaurus dictionary has the capability to recognize phrases it must remember its state and interact with the parser. A thesaurus dictionary uses these assignments to check if it should handle the next word or stop accumulation. The thesaurus dictionary must be configured carefully. For example, if the thesaurus dictionary is assigned to handle only theasciiwordtoken, then a thesaurus dictionary definition likeone 7will not work since token typeuintis not assigned to the thesaurus dictionary.

Caution

Thesauruses are used during indexing so any change in the thesaurus dictionary's parameters_requires_reindexing. For most other dictionary types, small changes such as adding or removing stopwords does not force reindexing.

12.6.4.1. Thesaurus Configuration

To define a new thesaurus dictionary, use thethesaurustemplate. For example:

CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
    TEMPLATE = thesaurus,
    DictFile = mythesaurus,
    Dictionary = pg_catalog.english_stem
);

Here:

thesaurus_simpleis the new dictionary's name
mythesaurusis the base name of the thesaurus configuration file. (Its full name will be$SHAREDIR/tsearch_data/mythesaurus.ths, where$SHAREDIRmeans the installation shared-data directory.)
pg_catalog.english_stemis the subdictionary (here, a Snowball English stemmer) to use for thesaurus normalization. Notice that the subdictionary will have its own configuration (for example, stop words), which is not shown here.

Now it is possible to bind the thesaurus dictionarythesaurus_simpleto the desired token types in a configuration, for example:

ALTER TEXT SEARCH CONFIGURATION russian
    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
    WITH thesaurus_simple;

12.6.4.2. Thesaurus Example

Consider a simple astronomical thesaurusthesaurus_astro, which contains some astronomical word combinations:

supernovae stars : sn
crab nebulae : crab

Below we create a dictionary and bind some token types to an astronomical thesaurus and English stemmer:

CREATE TEXT SEARCH DICTIONARY thesaurus_astro (
    TEMPLATE = thesaurus,
    DictFile = thesaurus_astro,
    Dictionary = english_stem
);

ALTER TEXT SEARCH CONFIGURATION russian
    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
    WITH thesaurus_astro, english_stem;

Now we can see how it works.ts_lexizeis not very useful for testing a thesaurus, because it treats its input as a single token. Instead we can useplainto_tsqueryandto_tsvectorwhich will break their input strings into multiple tokens:

SELECT plainto_tsquery('supernova star');
 plainto_tsquery
-----------------
 'sn'

SELECT to_tsvector('supernova star');
 to_tsvector
-------------
 'sn':1

In principle, one can useto_tsqueryif you quote the argument:

SELECT to_tsquery('''supernova star''');
 to_tsquery
------------
 'sn'

Notice thatsupernova starmatchessupernovae starsinthesaurus_astrobecause we specified theenglish_stemstemmer in the thesaurus definition. The stemmer removed theeands.

To index the original phrase as well as the substitute, just include it in the right-hand part of the definition:

supernovae stars : sn supernovae stars

SELECT plainto_tsquery('supernova star');
       plainto_tsquery
-----------------------------
 'sn' 
&
 'supernova' 
&
 'star'

12.6.5. IspellDictionary

TheIspelldictionary template supports_morphological dictionaries_, which can normalize many different linguistic forms of a word into the same lexeme. For example, an EnglishIspelldictionary can match all declensions and conjugations of the search termbank, e.g.,banking,banked,banks,banks', andbank's.

The standardPostgreSQLdistribution does not include anyIspellconfiguration files. Dictionaries for a large number of languages are available fromIspell. Also, some more modern dictionary file formats are supported —MySpell(OO < 2.0.1) andHunspell(OO >= 2.0.2). A large list of dictionaries is available on theOpenOffice Wiki.

To create anIspelldictionary perform these steps:

download dictionary configuration files.OpenOfficeextension files have the.oxtextension. It is necessary to extract.affand.dicfiles, change extensions to.affixand.dict. For some dictionary files it is also needed to convert characters to the UTF-8 encoding with commands (for example, for a Norwegian language dictionary):
```
iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
```
copy files to the$SHAREDIR/tsearch_datadirectory

load files into PostgreSQL with the following command:

CREATE TEXT SEARCH DICTIONARY english_hunspell (
    TEMPLATE = ispell,
    DictFile = en_us,
    AffFile = en_us,
    Stopwords = english);

Here,DictFile,AffFile, andStopWordsspecify the base names of the dictionary, affixes, and stop-words files. The stop-words file has the same format explained above for thesimpledictionary type. The format of the other files is not specified here but is available from the above-mentioned web sites.

Ispell dictionaries usually recognize a limited set of words, so they should be followed by another broader dictionary; for example, a Snowball dictionary, which recognizes everything.

The.affixfile ofIspellhas the following structure:

prefixes
flag *A:
    .           
>
   RE      # As in enter 
>
 reenter
suffixes
flag T:
    E           
>
   ST      # As in late 
>
 latest
    [^AEIOU]Y   
>
   -Y,IEST # As in dirty 
>
 dirtiest
    [AEIOU]Y    
>
   EST     # As in gray 
>
 grayest
    [^EY]       
>
   EST     # As in small 
>
 smallest

And the.dictfile has the following structure:

lapse/ADGRS
lard/DGRS
large/PRTY
lark/MRS

Format of the.dictfile is:

basic_form/affix_class_name

In the.affixfile every affix flag is described in the following format:

condition 
>
 [-stripping_letters,] adding_affix

Here, condition has a format similar to the format of regular expressions. It can use groupings[...]and[^...]. For example,[AEIOU]Ymeans that the last letter of the word is"y"and the penultimate letter is"a","e","i","o"or"u".[^EY]means that the last letter is neither"e"nor"y".

Ispell dictionaries support splitting compound words; a useful feature. Notice that the affix file should specify a special flag using thecompoundwords controlledstatement that marks dictionary words that can participate in compound formation:

compoundwords  controlled z

Here are some examples for the Norwegian language:

SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
   {over,buljong,terning,pakk,mester,assistent}
SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
   {sjokoladefabrikk,sjokolade,fabrikk}

MySpellformat is a subset ofHunspell. The.affixfile ofHunspellhas the following structure:

PFX A Y 1
PFX A   0     re         .
SFX T N 4
SFX T   0     st         e
SFX T   y     iest       [^aeiou]y
SFX T   0     est        [aeiou]y
SFX T   0     est        [^ey]

The first line of an affix class is the header. Fields of an affix rules are listed after the header:

parameter name (PFX or SFX)
flag (name of the affix class)
stripping characters from beginning (at prefix) or end (at suffix) of the word
adding affix
condition that has a format similar to the format of regular expressions.

The.dictfile looks like the.dictfile ofIspell:

larder/M
lardy/RT
large/RSPMYT
largehearted

Note

MySpelldoes not support compound words.Hunspellhas sophisticated support for compound words. At present,PostgreSQLimplements only the basic compound word operations of Hunspell.

12.6.6. SnowballDictionary

TheSnowballdictionary template is based on a project by Martin Porter, inventor of the popular Porter's stemming algorithm for the English language. Snowball now provides stemming algorithms for many languages (see theSnowball sitefor more information). Each algorithm understands how to reduce common variant forms of words to a base, or stem, spelling within its language. A Snowball dictionary requires alanguageparameter to identify which stemmer to use, and optionally can specify astopwordfile name that gives a list of words to eliminate. (PostgreSQL's standard stopword lists are also provided by the Snowball project.) For example, there is a built-in definition equivalent to

CREATE TEXT SEARCH DICTIONARY english_stem (
    TEMPLATE = snowball,
    Language = english,
    StopWords = english
);

The stopword file format is the same as already explained.

ASnowballdictionary recognizes everything, whether or not it is able to simplify the word, so it should be placed at the end of the dictionary list. It is useless to have it before any other dictionary because a token will never pass through it to the next dictionary.

12.7. 組態範例

A text search configuration specifies all options necessary to transform a document into atsvector: the parser to use to break text into tokens, and the dictionaries to use to transform each token into a lexeme. Every call ofto_tsvectororto_tsqueryneeds a text search configuration to perform its processing. The configuration parameterspecifies the name of the default configuration, which is the one used by text search functions if an explicit configuration parameter is omitted. It can be set inpostgresql.conf, or set for an individual session using theSETcommand.

Several predefined text search configurations are available, and you can create custom configurations easily. To facilitate management of text search objects, a set ofSQLcommands is available, and there are severalpsqlcommands that display information about text search objects ().

As an example we will create a configurationpg, starting by duplicating the built-inenglishconfiguration:

We will use a PostgreSQL-specific synonym list and store it in$SHAREDIR/tsearch_data/pg_dict.syn. The file contents look like:

We define the synonym dictionary like this:

Next we register theIspelldictionaryenglish_ispell, which has its own configuration files:

Now we can set up the mappings for words in configurationpg:

We choose not to index or search some token types that the built-in configuration does handle:

Now we can test our configuration:

The next step is to set the session to use the new configuration, which was created in thepublicschema:

12.8. 測試與除錯

The behavior of a custom text search configuration can easily become confusing. The functions described in this section are useful for testing text search objects. You can test a complete configuration, or test parsers and dictionaries separately.

12.8.1. Configuration Testing

The functionts_debugallows easy testing of a text search configuration.

ts_debug([
config
regconfig
, 
] 
document
text
,
         OUT 
alias
text
,
         OUT 
description
text
,
         OUT 
token
text
,
         OUT 
dictionaries
regdictionary[]
,
         OUT 
dictionary
regdictionary
,
         OUT 
lexemes
text[]
)
         returns setof record

ts_debugdisplays information about every token ofdocument_as produced by the parser and processed by the configured dictionaries. It uses the configuration specified byconfig_, ordefault_text_search_configif that argument is omitted.

ts_debugreturns one row for each token identified in the text by the parser. The columns returned are

aliastext— short name of the token type
descriptiontext— description of the token type
tokentext— text of the token
dictionariesregdictionary[]— the dictionaries selected by the configuration for this token type
dictionaryregdictionary— the dictionary that recognized the token, orNULLif none did
lexemestext[]— the lexeme(s) produced by the dictionary that recognized the token, orNULLif none did; an empty array ({}) means it was recognized as a stop word

Here is a simple example:

SELECT * FROM ts_debug('english','a fat  cat sat on a mat - it ate a fat rats');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes 
-----------+-----------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII | a     | {english_stem} | english_stem | {}
 blank     | Space symbols   |       | {}             |              | 
 asciiword | Word, all ASCII | fat   | {english_stem} | english_stem | {fat}
 blank     | Space symbols   |       | {}             |              | 
 asciiword | Word, all ASCII | cat   | {english_stem} | english_stem | {cat}
 blank     | Space symbols   |       | {}             |              | 
 asciiword | Word, all ASCII | sat   | {english_stem} | english_stem | {sat}
 blank     | Space symbols   |       | {}             |              | 
 asciiword | Word, all ASCII | on    | {english_stem} | english_stem | {}
 blank     | Space symbols   |       | {}             |              | 
 asciiword | Word, all ASCII | a     | {english_stem} | english_stem | {}
 blank     | Space symbols   |       | {}             |              | 
 asciiword | Word, all ASCII | mat   | {english_stem} | english_stem | {mat}
 blank     | Space symbols   |       | {}             |              | 
 blank     | Space symbols   | -     | {}             |              | 
 asciiword | Word, all ASCII | it    | {english_stem} | english_stem | {}
 blank     | Space symbols   |       | {}             |              | 
 asciiword | Word, all ASCII | ate   | {english_stem} | english_stem | {ate}
 blank     | Space symbols   |       | {}             |              | 
 asciiword | Word, all ASCII | a     | {english_stem} | english_stem | {}
 blank     | Space symbols   |       | {}             |              | 
 asciiword | Word, all ASCII | fat   | {english_stem} | english_stem | {fat}
 blank     | Space symbols   |       | {}             |              | 
 asciiword | Word, all ASCII | rats  | {english_stem} | english_stem | {rat}

For a more extensive demonstration, we first create apublic.englishconfiguration and Ispell dictionary for the English language:

CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english );

CREATE TEXT SEARCH DICTIONARY english_ispell (
    TEMPLATE = ispell,
    DictFile = english,
    AffFile = english,
    StopWords = english
);

ALTER TEXT SEARCH CONFIGURATION public.english
   ALTER MAPPING FOR asciiword WITH english_ispell, english_stem;

SELECT * FROM ts_debug('public.english','The Brightest supernovaes');
   alias   |   description   |    token    |         dictionaries          |   dictionary   |   lexemes   
-----------+-----------------+-------------+-------------------------------+----------------+-------------
 asciiword | Word, all ASCII | The         | {english_ispell,english_stem} | english_ispell | {}
 blank     | Space symbols   |             | {}                            |                | 
 asciiword | Word, all ASCII | Brightest   | {english_ispell,english_stem} | english_ispell | {bright}
 blank     | Space symbols   |             | {}                            |                | 
 asciiword | Word, all ASCII | supernovaes | {english_ispell,english_stem} | english_stem   | {supernova}

In this example, the wordBrightestwas recognized by the parser as anASCII word(aliasasciiword). For this token type the dictionary list isenglish_ispellandenglish_stem. The word was recognized byenglish_ispell, which reduced it to the nounbright. The wordsupernovaesis unknown to theenglish_ispelldictionary so it was passed to the next dictionary, and, fortunately, was recognized (in fact,english_stemis a Snowball dictionary which recognizes everything; that is why it was placed at the end of the dictionary list).

The wordThewas recognized by theenglish_ispelldictionary as a stop word (Section 12.6.1) and will not be indexed. The spaces are discarded too, since the configuration provides no dictionaries at all for them.

You can reduce the width of the output by explicitly specifying which columns you want to see:

SELECT alias, token, dictionary, lexemes
FROM ts_debug('public.english','The Brightest supernovaes');
   alias   |    token    |   dictionary   |   lexemes   
-----------+-------------+----------------+-------------
 asciiword | The         | english_ispell | {}
 blank     |             |                | 
 asciiword | Brightest   | english_ispell | {bright}
 blank     |             |                | 
 asciiword | supernovaes | english_stem   | {supernova}

12.8.2. Parser Testing

The following functions allow direct testing of a text search parser.

ts_parse(
parser_name
text
, 
document
text
,
         OUT 
tokid
integer
, OUT 
token
text
) returns 
setof record

ts_parse(
parser_oid
oid
, 
document
text
,
         OUT 
tokid
integer
, OUT 
token
text
) returns 
setof record

ts_parseparses the given_document_and returns a series of records, one for each token produced by parsing. Each record includes atokidshowing the assigned token type and atokenwhich is the text of the token. For example:

SELECT * FROM ts_parse('default', '123 - a number');
 tokid | token
-------+--------
    22 | 123
    12 |
    12 | -
     1 | a
    12 |
     1 | number

ts_token_type(
parser_name
text
, OUT 
tokid
integer
,
              OUT 
alias
text
, OUT 
description
text
) returns 
setof record

ts_token_type(
parser_oid
oid
, OUT 
tokid
integer
,
              OUT 
alias
text
, OUT 
description
text
) returns 
setof record

ts_token_typereturns a table which describes each type of token the specified parser can recognize. For each token type, the table gives the integertokidthat the parser uses to label a token of that type, thealiasthat names the token type in configuration commands, and a shortdescription. For example:

SELECT * FROM ts_token_type('default');
 tokid |      alias      |               description                
-------+-----------------+------------------------------------------
     1 | asciiword       | Word, all ASCII
     2 | word            | Word, all letters
     3 | numword         | Word, letters and digits
     4 | email           | Email address
     5 | url             | URL
     6 | host            | Host
     7 | sfloat          | Scientific notation
     8 | version         | Version number
     9 | hword_numpart   | Hyphenated word part, letters and digits
    10 | hword_part      | Hyphenated word part, all letters
    11 | hword_asciipart | Hyphenated word part, all ASCII
    12 | blank           | Space symbols
    13 | tag             | XML tag
    14 | protocol        | Protocol head
    15 | numhword        | Hyphenated word, letters and digits
    16 | asciihword      | Hyphenated word, all ASCII
    17 | hword           | Hyphenated word, all letters
    18 | url_path        | URL path
    19 | file            | File or path name
    20 | float           | Decimal notation
    21 | int             | Signed integer
    22 | uint            | Unsigned integer
    23 | entity          | XML entity

12.8.3. Dictionary Testing

Thets_lexizefunction facilitates dictionary testing.

ts_lexize(
dict
regdictionary
, 
token
text
) returns 
text[]

ts_lexizereturns an array of lexemes if the input_token_is known to the dictionary, or an empty array if the token is known to the dictionary but it is a stop word, orNULLif it is an unknown word.

Examples:

SELECT ts_lexize('english_stem', 'stars');
 ts_lexize
-----------
 {star}

SELECT ts_lexize('english_stem', 'a');
 ts_lexize
-----------
 {}

Note

Thets_lexizefunction expects a single_token_, not text. Here is a case where this can be confusing:

SELECT ts_lexize('thesaurus_astro','supernovae stars') is null;
 ?column?
----------
 t

The thesaurus dictionarythesaurus_astrodoes know the phrasesupernovae stars, butts_lexizefails since it does not parse the input text but treats it as a single token. Useplainto_tsqueryorto_tsvectorto test thesaurus dictionaries, for example:

SELECT plainto_tsquery('supernovae stars');
 plainto_tsquery
-----------------
 'sn'

12.9. GIN 及 GiST 索引型別

There are two kinds of indexes that can be used to speed up full text searches. Note that indexes are not mandatory for full text searching, but in cases where a column is searched on a regular basis, an index is usually desirable.

CREATE INDEX

name

table

USING GIN (

column

);

Creates a GIN (Generalized Inverted Index)-based index. The_column_must be oftsvectortype.

CREATE INDEX

name

table

USING GIST (

column

);

Creates a GiST (Generalized Search Tree)-based index. The_column_can be oftsvectorortsquerytype.

GIN indexes are the preferred text search index type. As inverted indexes, they contain an index entry for each word (lexeme), with a compressed list of matching locations. Multi-word searches can find the first match, then use the index to remove rows that are lacking additional words. GIN indexes store only the words (lexemes) oftsvectorvalues, and not their weight labels. Thus a table row recheck is needed when using a query that involves weights.

A GiST index is_lossy_, meaning that the index might produce false matches, and it is necessary to check the actual table row to eliminate such false matches. (PostgreSQLdoes this automatically when needed.) GiST indexes are lossy because each document is represented in the index by a fixed-length signature. The signature is generated by hashing each word into a single bit in an n-bit string, with all these bits OR-ed together to produce an n-bit document signature. When two words hash to the same bit position there will be a false match. If all words in the query have matches (real or false) then the table row must be retrieved to see if the match is correct.

Lossiness causes performance degradation due to unnecessary fetches of table records that turn out to be false matches. Since random access to table records is slow, this limits the usefulness of GiST indexes. The likelihood of false matches depends on several factors, in particular the number of unique words, so using dictionaries to reduce this number is recommended.

12.10. psql支援

Information about text search configuration objects can be obtained inpsqlusing a set of commands:

\dF{d,p,t}[
+
] [
PATTERN
]

An optional+produces more details.

The optional parameterPATTERN_can be the name of a text search object, optionally schema-qualified. IfPATTERNis omitted then information about all visible objects will be displayed.PATTERN_can be a regular expression and can provide_separate_patterns for the schema and object names. The following examples illustrate this:

=
>
 \dF *fulltext*
       List of text search configurations
 Schema |  Name        | Description
--------+--------------+-------------
 public | fulltext_cfg |

=
>
 \dF *.fulltext*
       List of text search configurations
 Schema   |  Name        | Description
----------+----------------------------
 fulltext | fulltext_cfg |
 public   | fulltext_cfg |

The available commands are:

\dF[

+

] [

PATTERN

]

List text search configurations (add+for more detail).

=
>
 \dF russian
            List of text search configurations
   Schema   |  Name   |            Description             
------------+---------+------------------------------------
 pg_catalog | russian | configuration for russian language

=
>
 \dF+ russian
Text search configuration "pg_catalog.russian"
Parser: "pg_catalog.default"
      Token      | Dictionaries 
-----------------+--------------
 asciihword      | english_stem
 asciiword       | english_stem
 email           | simple
 file            | simple
 float           | simple
 host            | simple
 hword           | russian_stem
 hword_asciipart | english_stem
 hword_numpart   | simple
 hword_part      | russian_stem
 int             | simple
 numhword        | simple
 numword         | simple
 sfloat          | simple
 uint            | simple
 url             | simple
 url_path        | simple
 version         | simple
 word            | russian_stem

\dFd[

+

] [

PATTERN

]

List text search dictionaries (add+for more detail).

=
>
 \dFd
                            List of text search dictionaries
   Schema   |      Name       |                        Description                        
------------+-----------------+-----------------------------------------------------------
 pg_catalog | danish_stem     | snowball stemmer for danish language
 pg_catalog | dutch_stem      | snowball stemmer for dutch language
 pg_catalog | english_stem    | snowball stemmer for english language
 pg_catalog | finnish_stem    | snowball stemmer for finnish language
 pg_catalog | french_stem     | snowball stemmer for french language
 pg_catalog | german_stem     | snowball stemmer for german language
 pg_catalog | hungarian_stem  | snowball stemmer for hungarian language
 pg_catalog | italian_stem    | snowball stemmer for italian language
 pg_catalog | norwegian_stem  | snowball stemmer for norwegian language
 pg_catalog | portuguese_stem | snowball stemmer for portuguese language
 pg_catalog | romanian_stem   | snowball stemmer for romanian language
 pg_catalog | russian_stem    | snowball stemmer for russian language
 pg_catalog | simple          | simple dictionary: just lower case and check for stopword
 pg_catalog | spanish_stem    | snowball stemmer for spanish language
 pg_catalog | swedish_stem    | snowball stemmer for swedish language
 pg_catalog | turkish_stem    | snowball stemmer for turkish language

\dFp[

+

] [

PATTERN

]

List text search parsers (add+for more detail).

=
>
 \dFp
        List of text search parsers
   Schema   |  Name   |     Description     
------------+---------+---------------------
 pg_catalog | default | default word parser
=
>
 \dFp+
    Text search parser "pg_catalog.default"
     Method      |    Function    | Description 
-----------------+----------------+-------------
 Start parse     | prsd_start     | 
 Get next token  | prsd_nexttoken | 
 End parse       | prsd_end       | 
 Get headline    | prsd_headline  | 
 Get token types | prsd_lextype   | 

        Token types for parser "pg_catalog.default"
   Token name    |               Description                
-----------------+------------------------------------------
 asciihword      | Hyphenated word, all ASCII
 asciiword       | Word, all ASCII
 blank           | Space symbols
 email           | Email address
 entity          | XML entity
 file            | File or path name
 float           | Decimal notation
 host            | Host
 hword           | Hyphenated word, all letters
 hword_asciipart | Hyphenated word part, all ASCII
 hword_numpart   | Hyphenated word part, letters and digits
 hword_part      | Hyphenated word part, all letters
 int             | Signed integer
 numhword        | Hyphenated word, letters and digits
 numword         | Word, letters and digits
 protocol        | Protocol head
 sfloat          | Scientific notation
 tag             | XML tag
 uint            | Unsigned integer
 url             | URL
 url_path        | URL path
 version         | Version number
 word            | Word, all letters
(23 rows)

\dFt[

+

] [

PATTERN

]

List text search templates (add+for more detail).

=
>
 \dFt
                           List of text search templates
   Schema   |   Name    |                        Description                        
------------+-----------+-----------------------------------------------------------
 pg_catalog | ispell    | ispell dictionary
 pg_catalog | simple    | simple dictionary: just lower case and check for stopword
 pg_catalog | snowball  | snowball stemmer
 pg_catalog | synonym   | synonym dictionary: replace word by its synonym
 pg_catalog | thesaurus | thesaurus dictionary: phrase by phrase substitution

12.11. 功能限制

The current limitations ofPostgreSQL's text search features are:

The length of each lexeme must be less than 2K bytes
The length of atsvector(lexemes + positions) must be less than 1 megabyte
The number of lexemes must be less than 264
Position values intsvectormust be greater than 0 and no more than 16,383
The match distance in a<N>(FOLLOWED BY)tsqueryoperator cannot be more than 16,384
No more than 256 positions per lexeme
The number of nodes (lexemes + operators) in atsquerymust be less than 32,768

For comparison, thePostgreSQL8.1 documentation contained 10,441 unique words, a total of 335,420 words, and the most frequent word“postgresql”was mentioned 6,127 times in 655 documents.

Another example — thePostgreSQLmailing list archives contained 910,989 unique words with 57,491,343 lexemes in 461,020 messages.

12.4. 延伸功能

This section describes additional functions and operators that are useful in connection with text search.

12.4.1. Manipulating Documents

tsvector

setweight(

vector

tsvector

weight

"char"

) returns

tsvector

Note that weight labels apply to_positions_, not_lexemes_. If the input vector has been stripped of positions thensetweightdoes nothing.

length(

vector

tsvector

) returns

integer

Returns the number of lexemes stored in the vector.

strip(

vector

tsvector

) returns

tsvector

A full list oftsvector-related functions is available inTable 9.41.

12.4.2. Manipulating Queries

tsquery

Returns the AND-combination of the two given queries.

tsquery

Returns the OR-combination of the two given queries.

!!

tsquery

Returns the negation (NOT) of the given query.

tsquery

Returns a query that searches for a match to the first given query immediately followed by a match to the second given query, using the<->(FOLLOWED BY)tsqueryoperator. For example:

SELECT to_tsquery('fat') 
<
-
>
 to_tsquery('cat | rat');
             ?column?
-----------------------------------
 'fat' 
<
-
>
 'cat' | 'fat' 
<
-
>
 'rat'

tsquery_phrase(

query1

tsquery

query2

tsquery

distance

integer

]) returns

tsquery

SELECT tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'), 10);
  tsquery_phrase
------------------
 'fat' 
<
10
>
 'cat'

numnode(

query

tsquery

) returns

integer

Returns the number of nodes (lexemes plus operators) in atsquery. This function is useful to determine if the_query_is meaningful (returns > 0), or contains only stop words (returns 0). Examples:

SELECT numnode(plainto_tsquery('the any'));
NOTICE:  query contains only stopword(s) or doesn't contain lexeme(s), ignored
 numnode
---------
       0

SELECT numnode('foo 
&
 bar'::tsquery);
 numnode
---------
       3

querytree(

query

tsquery

) returns

text

SELECT querytree(to_tsquery('!defined'));
 querytree
-----------

12.4.2.1. Query Rewriting

ts_rewrite (

query

tsquery

target

tsquery

substitute

tsquery

) returns

tsquery

This form ofts_rewritesimply applies a single rewrite rule:target_is replaced bysubstitutewherever it appears inquery_. For example:

SELECT ts_rewrite('a 
&
 b'::tsquery, 'a'::tsquery, 'c'::tsquery);
 ts_rewrite
------------
 'b' 
&
 'c'

ts_rewrite (

query

tsquery

select

text

) returns

tsquery

CREATE TABLE aliases (t tsquery PRIMARY KEY, s tsquery);
INSERT INTO aliases VALUES('a', 'c');

SELECT ts_rewrite('a 
&
 b'::tsquery, 'SELECT t,s FROM aliases');
 ts_rewrite
------------
 'b' 
&
 'c'

Note that when multiple rewrite rules are applied in this way, the order of application can be important; so in practice you will want the source query toORDER BYsome ordering key.

Let's consider a real-life astronomical example. We'll expand querysupernovaeusing table-driven rewriting rules:

CREATE TABLE aliases (t tsquery primary key, s tsquery);
INSERT INTO aliases VALUES(to_tsquery('supernovae'), to_tsquery('supernovae|sn'));

SELECT ts_rewrite(to_tsquery('supernovae 
&
 crab'), 'SELECT * FROM aliases');
           ts_rewrite            
---------------------------------
 'crab' 
&
 ( 'supernova' | 'sn' )

We can change the rewriting rules just by updating the table:

UPDATE aliases
SET s = to_tsquery('supernovae|sn 
&
 !nebulae')
WHERE t = to_tsquery('supernovae');

SELECT ts_rewrite(to_tsquery('supernovae 
&
 crab'), 'SELECT * FROM aliases');
                 ts_rewrite                  
---------------------------------------------
 'crab' 
&
 ( 'supernova' | 'sn' 
&
 !'nebula' )

SELECT ts_rewrite('a 
&
 b'::tsquery,
                  'SELECT t,s FROM aliases WHERE ''a 
&
 b''::tsquery @
>
 t');
 ts_rewrite
------------
 'b' 
&
 'c'

12.4.3. Triggers for Automatic Updates

tsvector_update_trigger(
tsvector_column_name
, 
config_name
, 
text_column_name
 [
, ... 
])
tsvector_update_trigger_column(
tsvector_column_name
, 
config_column_name
, 
text_column_name
 [
, ... 
])

These trigger functions automatically compute atsvectorcolumn from one or more textual columns, under the control of parameters specified in theCREATE TRIGGERcommand. An example of their use is:

CREATE TABLE messages (
    title       text,
    body        text,
    tsv         tsvector
);

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON messages FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(tsv, 'pg_catalog.english', title, body);

INSERT INTO messages VALUES('title here', 'the body text is here');

SELECT * FROM messages;
   title    |         body          |            tsv             
------------+-----------------------+----------------------------
 title here | the body text is here | 'bodi':4 'text':5 'titl':1

SELECT title, body FROM messages WHERE tsv @@ to_tsquery('title 
&
 body');
   title    |         body          
------------+-----------------------
 title here | the body text is here

Having created this trigger, any change intitleorbodywill automatically be reflected intotsv, without the application having to worry about it.

CREATE FUNCTION messages_trigger() RETURNS trigger AS $$
begin
  new.tsv :=
     setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'A') ||
     setweight(to_tsvector('pg_catalog.english', coalesce(new.body,'')), 'D');
  return new;
end
$$ LANGUAGE plpgsql;

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
    ON messages FOR EACH ROW EXECUTE PROCEDURE messages_trigger();

12.4.4. Gathering Document Statistics

The functionts_statis useful for checking your configuration and for finding stop-word candidates.

ts_stat(
sqlquery
text
, [
weights
text
, 
]
        OUT 
word
text
, OUT 
ndoc
integer
,
        OUT 
nentry
integer
) returns 
setof record

wordtext— the value of a lexeme
ndocinteger— number of documents (tsvectors) the word occurred in
nentryinteger— total number of occurrences of the word

If_weights_is supplied, only occurrences having one of those weights are counted.

For example, to find the ten most frequent words in a document collection:

SELECT * FROM ts_stat('SELECT vector FROM apod')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;

The same, but counting only word occurrences with weightAorB:

SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;

12.3. 細部控制

12.3.1. Parsing Documents

PostgreSQLprovides the functionto_tsvectorfor converting a document to thetsvectordata type.

to_tsvector([
config
regconfig
, 
] 
document
text
) returns 
tsvector

SELECT to_tsvector('english', 'a fat  cat sat on a mat - it ate a fat rats');
                  to_tsvector
-----------------------------------------------------
 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4

In the example above we see that the resultingtsvectordoes not contain the wordsa,on, orit, the wordratsbecamerat, and the punctuation sign-was ignored.

Becauseto_tsvector(NULL) will returnNULL, it is recommended to usecoalescewhenever a field might be null. Here is the recommended method for creating atsvectorfrom a structured document:

UPDATE tt SET ti =
    setweight(to_tsvector(coalesce(title,'')), 'A')    ||
    setweight(to_tsvector(coalesce(keyword,'')), 'B')  ||
    setweight(to_tsvector(coalesce(abstract,'')), 'C') ||
    setweight(to_tsvector(coalesce(body,'')), 'D');

12.3.2. Parsing Queries

to_tsquery([
config
regconfig
, 
] 
querytext
text
) returns 
tsquery

SELECT to_tsquery('english', 'The 
&
 Fat 
&
 Rats');
  to_tsquery   
---------------
 'fat' 
&
 'rat'

As in basictsqueryinput, weight(s) can be attached to each lexeme to restrict it to match onlytsvectorlexemes of those weight(s). For example:

SELECT to_tsquery('english', 'Fat | Rats:AB');
    to_tsquery    
------------------
 'fat' | 'rat':AB

Also,*can be attached to a lexeme to specify prefix matching:

SELECT to_tsquery('supern:*A 
&
 star:A*B');
        to_tsquery        
--------------------------
 'supern':*A 
&
 'star':*AB

Such a lexeme will match any word in atsvectorthat begins with the given string.

SELECT to_tsquery('''supernovae stars'' 
&
 !crab');
  to_tsquery
---------------
 'sn' 
&
 !'crab'

Without quotes,to_tsquerywill generate a syntax error for tokens that are not separated by an AND, OR, or FOLLOWED BY operator.

plainto_tsquery([
config
regconfig
, 
] 
querytext
text
) returns 
tsquery

Example:

SELECT plainto_tsquery('english', 'The Fat Rats');
 plainto_tsquery 
-----------------
 'fat' 
&
 'rat'

Note thatplainto_tsquerywill not recognizetsqueryoperators, weight labels, or prefix-match labels in its input:

SELECT plainto_tsquery('english', 'The Fat 
&
 Rats:C');
   plainto_tsquery   
---------------------
 'fat' 
&
 'rat' 
&
 'c'

Here, all the input punctuation was discarded as being space symbols.

phraseto_tsquery([
config
regconfig
, 
] 
querytext
text
) returns 
tsquery

Example:

SELECT phraseto_tsquery('english', 'The Fat Rats');
 phraseto_tsquery
------------------
 'fat' 
<
-
>
 'rat'

Likeplainto_tsquery, thephraseto_tsqueryfunction will not recognizetsqueryoperators, weight labels, or prefix-match labels in its input:

SELECT phraseto_tsquery('english', 'The Fat 
&
 Rats:C');
      phraseto_tsquery
-----------------------------
 'fat' 
<
-
>
 'rat' 
<
-
>
 'c'

12.3.3. Ranking Search Results

The two ranking functions currently available are:

ts_rank([

weights

float4[]

]

vector

tsvector

query

tsquery

[

normalization

integer

]) returns

float4

Ranks vectors based on the frequency of their matching lexemes.

ts_rank_cd([

weights

float4[]

]

vector

tsvector

query

tsquery

[

normalization

integer

]) returns

float4

{D-weight, C-weight, B-weight, A-weight}

If no_weights_are provided, then these defaults are used:

{0.1, 0.2, 0.4, 1.0}

0 (the default) ignores the document length
1 divides the rank by 1 + the logarithm of the document length
2 divides the rank by the document length
4 divides the rank by the mean harmonic distance between extents (this is implemented only byts_rank_cd)
8 divides the rank by the number of unique words in document
16 divides the rank by 1 + the logarithm of the number of unique words in document
32 divides the rank by itself + 1

If more than one flag bit is specified, the transformations are applied in the order listed.

Here is an example that selects only the ten highest-ranked matches:

SELECT title, ts_rank_cd(textsearch, query) AS rank
FROM apod, to_tsquery('neutrino|(dark 
&
 matter)') query
WHERE query @@ textsearch
ORDER BY rank DESC
LIMIT 10;
                     title                     |   rank
-----------------------------------------------+----------
 Neutrinos in the Sun                          |      3.1
 The Sudbury Neutrino Detector                 |      2.4
 A MACHO View of Galactic Dark Matter          |  2.01317
 Hot Gas and Dark Matter                       |  1.91171
 The Virgo Cluster: Hot Plasma and Dark Matter |  1.90953
 Rafting for Solar Neutrinos                   |      1.9
 NGC 4650A: Strange Galaxy and Dark Matter     |  1.85774
 Hot Gas and Dark Matter                       |   1.6123
 Ice Fishing for Cosmic Neutrinos              |      1.6
 Weak Lensing Distorts the Universe            | 0.818218

This is the same example using normalized ranking:

SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank
FROM apod, to_tsquery('neutrino|(dark 
&
 matter)') query
WHERE  query @@ textsearch
ORDER BY rank DESC
LIMIT 10;
                     title                     |        rank
-----------------------------------------------+-------------------
 Neutrinos in the Sun                          | 0.756097569485493
 The Sudbury Neutrino Detector                 | 0.705882361190954
 A MACHO View of Galactic Dark Matter          | 0.668123210574724
 Hot Gas and Dark Matter                       |  0.65655958650282
 The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
 Rafting for Solar Neutrinos                   | 0.655172410958162
 NGC 4650A: Strange Galaxy and Dark Matter     | 0.650072921219637
 Hot Gas and Dark Matter                       | 0.617195790024749
 Ice Fishing for Cosmic Neutrinos              | 0.615384618911517
 Weak Lensing Distorts the Universe            | 0.450010798361481

12.3.4. Highlighting Results

ts_headline([
config
regconfig
, 
] 
document
text
, 
query
tsquery
 [
, 
options
text
]) returns 
text

If anoptions_string is specified it must consist of a comma-separated list of one or moreoption=value_pairs. The available options are:

StartSel,StopSel: the strings with which to delimit query words appearing in the document, to distinguish them from other excerpted words. You must double-quote these strings if they contain spaces or commas.
MaxWords,MinWords: these numbers determine the longest and shortest headlines to output.
ShortWord: words of this length or less will be dropped at the start and end of a headline. The default value of three eliminates common English articles.
HighlightAll: Boolean flag; iftruethe whole document will be used as the headline, ignoring the preceding three parameters.
MaxFragments: maximum number of text excerpts or fragments to display. The default value of zero selects a non-fragment-oriented headline generation method. A value greater than zero selects fragment-based headline generation. This method finds text fragments with as many query words as possible and stretches those fragments around the query words. As a result query words are close to the middle of each fragment and have words on each side. Each fragment will be of at mostMaxWordsand words of lengthShortWordor less are dropped at the start and end of each fragment. If not all query words are found in the document, then a single fragment of the firstMinWordsin the document will be displayed.
FragmentDelimiter: When more than one fragment is displayed, the fragments will be separated by this string.

Any unspecified options receive these defaults:

StartSel=
<
b
>
, StopSel=
<
/b
>
,
MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE,
MaxFragments=0, FragmentDelimiter=" ... "

For example:

SELECT ts_headline('english',
  'The most common type of search
is to find all documents containing given query terms
and return them in order of their similarity to the
query.',
  to_tsquery('query 
&
 similarity'));
                        ts_headline                         
------------------------------------------------------------
 containing given 
<
b
>
query
<
/b
>
 terms
 and return them in order of their 
<
b
>
similarity
<
/b
>
 to the

<
b
>
query
<
/b
>
.

SELECT ts_headline('english',
  'The most common type of search
is to find all documents containing given query terms
and return them in order of their similarity to the
query.',
  to_tsquery('query 
&
 similarity'),
  'StartSel = 
<
, StopSel = 
>
');
                      ts_headline                      
-------------------------------------------------------
 containing given 
<
query
>
 terms
 and return them in order of their 
<
similarity
>
 to the

<
query
>
.

ts_headlineuses the original document, not atsvectorsummary, so it can be slow and should be used with care.

12.6. 字典

Some examples of normalization:

Linguistic - Ispell dictionaries try to reduce input words to a normalized form; stemmer dictionaries remove word endings
URLlocations can be canonicalized to make equivalent URLs match:
Color names can be replaced by their hexadecimal values, e.g.,red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF
If indexing numbers, we can remove some fractional digits to reduce the range of possible numbers, so for example_3.14_159265359,_3.14_15926,_3.14_will be the same after normalization if only two digits are kept after the decimal point.

A dictionary is a program that accepts a token as input and returns:

an array of lexemes if the input token is known to the dictionary (notice that one token can produce more than one lexeme)
a single lexeme with theTSL_FILTERflag set, to replace the original token with a new token to be passed to subsequent dictionaries (a dictionary that does this is called a_filtering dictionary_)
an empty array if the dictionary knows the token, but it is a stop word
NULLif the dictionary does not recognize the input token

ALTER TEXT SEARCH CONFIGURATION astro_en
    ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;

12.6.1. Stop Words

SELECT to_tsvector('english','in the list of stop words');
        to_tsvector
----------------------------
 'list':3 'stop':5 'word':6

The missing positions 1,2,4 are because of stop words. Ranks calculated for documents with and without stop words are quite different:

SELECT ts_rank_cd (to_tsvector('english','in the list of stop words'), to_tsquery('list 
&
 stop'));
 ts_rank_cd
------------
       0.05

SELECT ts_rank_cd (to_tsvector('english','list stop words'), to_tsquery('list 
&
 stop'));
 ts_rank_cd
------------
        0.1

12.6.2. Simple Dictionary

Here is an example of a dictionary definition using thesimpletemplate:

CREATE TEXT SEARCH DICTIONARY public.simple_dict (
    TEMPLATE = pg_catalog.simple,
    STOPWORDS = english
);

Now we can test our dictionary:

SELECT ts_lexize('public.simple_dict','YeS');
 ts_lexize
-----------
 {yes}

SELECT ts_lexize('public.simple_dict','The');
 ts_lexize
-----------
 {}

ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false );

SELECT ts_lexize('public.simple_dict','YeS');
 ts_lexize
-----------


SELECT ts_lexize('public.simple_dict','The');
 ts_lexize
-----------
 {}

Caution

12.6.3. Synonym Dictionary

SELECT * FROM ts_debug('english', 'Paris');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes 
-----------+-----------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari}

CREATE TEXT SEARCH DICTIONARY my_synonym (
    TEMPLATE = synonym,
    SYNONYMS = my_synonyms
);

ALTER TEXT SEARCH CONFIGURATION english
    ALTER MAPPING FOR asciiword
    WITH my_synonym, english_stem;

SELECT * FROM ts_debug('english', 'Paris');
   alias   |   description   | token |       dictionaries        | dictionary | lexemes 
-----------+-----------------+-------+---------------------------+------------+---------
 asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris}

postgres        pgsql
postgresql      pgsql
postgre pgsql
gogle   googl
indices index*

Then we will get these results:

mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
mydb=# SELECT ts_lexize('syn','indices');
 ts_lexize
-----------
 {index}
(1 row)

mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
mydb=# SELECT to_tsvector('tst','indices');
 to_tsvector
-------------
 'index':1
(1 row)

mydb=# SELECT to_tsquery('tst','indices');
 to_tsquery
------------
 'index':*
(1 row)

mydb=# SELECT 'indexes are very useful'::tsvector;
            tsvector             
---------------------------------
 'are' 'indexes' 'useful' 'very'
(1 row)

mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices');
 ?column?
----------
 t
(1 row)

12.6.4. Thesaurus Dictionary

# this is a comment
sample word(s) : indexed word(s)
more sample word(s) : more indexed word(s)
...

where the colon (:) symbol acts as a delimiter between a phrase and its replacement.

The thesaurus dictionary chooses the longest match if there are multiple phrases matching the input, and ties are broken by using the last definition.

? one ? two : swsw

matchesa one the twoandthe one a two; both would be replaced byswsw.

Caution

12.6.4.1. Thesaurus Configuration

To define a new thesaurus dictionary, use thethesaurustemplate. For example:

CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
    TEMPLATE = thesaurus,
    DictFile = mythesaurus,
    Dictionary = pg_catalog.english_stem
);

Here:

thesaurus_simpleis the new dictionary's name
mythesaurusis the base name of the thesaurus configuration file. (Its full name will be$SHAREDIR/tsearch_data/mythesaurus.ths, where$SHAREDIRmeans the installation shared-data directory.)
pg_catalog.english_stemis the subdictionary (here, a Snowball English stemmer) to use for thesaurus normalization. Notice that the subdictionary will have its own configuration (for example, stop words), which is not shown here.

Now it is possible to bind the thesaurus dictionarythesaurus_simpleto the desired token types in a configuration, for example:

ALTER TEXT SEARCH CONFIGURATION russian
    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
    WITH thesaurus_simple;

12.6.4.2. Thesaurus Example

Consider a simple astronomical thesaurusthesaurus_astro, which contains some astronomical word combinations:

supernovae stars : sn
crab nebulae : crab

Below we create a dictionary and bind some token types to an astronomical thesaurus and English stemmer:

CREATE TEXT SEARCH DICTIONARY thesaurus_astro (
    TEMPLATE = thesaurus,
    DictFile = thesaurus_astro,
    Dictionary = english_stem
);

ALTER TEXT SEARCH CONFIGURATION russian
    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
    WITH thesaurus_astro, english_stem;

SELECT plainto_tsquery('supernova star');
 plainto_tsquery
-----------------
 'sn'

SELECT to_tsvector('supernova star');
 to_tsvector
-------------
 'sn':1

In principle, one can useto_tsqueryif you quote the argument:

SELECT to_tsquery('''supernova star''');
 to_tsquery
------------
 'sn'

Notice thatsupernova starmatchessupernovae starsinthesaurus_astrobecause we specified theenglish_stemstemmer in the thesaurus definition. The stemmer removed theeands.

To index the original phrase as well as the substitute, just include it in the right-hand part of the definition:

supernovae stars : sn supernovae stars

SELECT plainto_tsquery('supernova star');
       plainto_tsquery
-----------------------------
 'sn' 
&
 'supernova' 
&
 'star'

12.6.5. IspellDictionary

To create anIspelldictionary perform these steps:

download dictionary configuration files.OpenOfficeextension files have the.oxtextension. It is necessary to extract.affand.dicfiles, change extensions to.affixand.dict. For some dictionary files it is also needed to convert characters to the UTF-8 encoding with commands (for example, for a Norwegian language dictionary):
```
iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
```
copy files to the$SHAREDIR/tsearch_datadirectory

load files into PostgreSQL with the following command:

CREATE TEXT SEARCH DICTIONARY english_hunspell (
    TEMPLATE = ispell,
    DictFile = en_us,
    AffFile = en_us,
    Stopwords = english);

Ispell dictionaries usually recognize a limited set of words, so they should be followed by another broader dictionary; for example, a Snowball dictionary, which recognizes everything.

The.affixfile ofIspellhas the following structure:

prefixes
flag *A:
    .           
>
   RE      # As in enter 
>
 reenter
suffixes
flag T:
    E           
>
   ST      # As in late 
>
 latest
    [^AEIOU]Y   
>
   -Y,IEST # As in dirty 
>
 dirtiest
    [AEIOU]Y    
>
   EST     # As in gray 
>
 grayest
    [^EY]       
>
   EST     # As in small 
>
 smallest

And the.dictfile has the following structure:

lapse/ADGRS
lard/DGRS
large/PRTY
lark/MRS

Format of the.dictfile is:

basic_form/affix_class_name

In the.affixfile every affix flag is described in the following format:

condition 
>
 [-stripping_letters,] adding_affix

compoundwords  controlled z

Here are some examples for the Norwegian language:

SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
   {over,buljong,terning,pakk,mester,assistent}
SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
   {sjokoladefabrikk,sjokolade,fabrikk}

MySpellformat is a subset ofHunspell. The.affixfile ofHunspellhas the following structure:

PFX A Y 1
PFX A   0     re         .
SFX T N 4
SFX T   0     st         e
SFX T   y     iest       [^aeiou]y
SFX T   0     est        [aeiou]y
SFX T   0     est        [^ey]

The first line of an affix class is the header. Fields of an affix rules are listed after the header:

parameter name (PFX or SFX)
flag (name of the affix class)
stripping characters from beginning (at prefix) or end (at suffix) of the word
adding affix
condition that has a format similar to the format of regular expressions.

The.dictfile looks like the.dictfile ofIspell:

larder/M
lardy/RT
large/RSPMYT
largehearted

Note

MySpelldoes not support compound words.Hunspellhas sophisticated support for compound words. At present,PostgreSQLimplements only the basic compound word operations of Hunspell.

12.6.6. SnowballDictionary

CREATE TEXT SEARCH DICTIONARY english_stem (
    TEMPLATE = snowball,
    Language = english,
    StopWords = english
);

The stopword file format is the same as already explained.