Querying Full-Text Data in Microsoft SQL Server 2012
- 12/15/2012
- Before You Begin
- Lesson 1: Creating Full-Text Catalogs and Indexes
- Lesson 2: Using the CONTAINS and FREETEXT Predicates
- Lesson 3: Using the Full-Text and Semantic Search Table-Valued Functions
- Case Scenarios
- Suggested Practices
- Answers
Lesson 2: Using the CONTAINS and FREETEXT Predicates
SQL Server supports two very powerful predicates for limiting the result set of a query by using full-text indexes. These two predicates are the CONTAINS and FREETEXT predicates. Both of them support various term searching. Besides these two predicates, SQL Server supports two table-valued functions for full-text searches, and three table-valued functions for semantic searches. You learn about the two predicates in this lesson and about the five functions in the next lesson.
The CONTAINS Predicate
With the CONTAINS predicate, you can search for the following:
Words and phrases in text
Exact or fuzzy matches
Inflectional forms of a word
Text in which a search word is close to another search word
Synonyms of a searched word
A prefix of a word or a phrase only
You can also add your custom weight to words you are searching for. You use the CONTAINS predicate in the WHERE clause of your T-SQL statements.
For all details about this predicate, see the Books Online for SQL Server 2012 article “CONTAINS (Transact-SQL)” at http://msdn.microsoft.com/en-us/library/ms187787.aspx. Here are the most important forms of queries with the CONTAINS predicate in pseudo-code, where FTcolumn stands for a full-text indexed column and ‘SearchWord?’ stands for the word or phrase searched:
SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘SearchWord1’) This is the simplest form. You are searching for rows where the FTcolumn contains an exact match for ‘SearchWord1’. This is a simple term.
SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘SearchWord1 OR SearchWord2’) You are searching for rows where the FTcolumn contains an exact match for ‘SearchWord1’ or for the word ‘SearchWord2’. You can also use AND and AND NOT logical operators and change the order of evaluation of the operators in an expression with parentheses.
SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘“SearchWord1 SearchWord2”’) You are searching for rows where the FTcolumn contains an exact match for the phrase “SearchWord1 SearchWord2.”
SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘”SearchWord1*”’) You are searching for rows where the FTcolumn contains at least one word that starts with the letters ‘SearchWord1’. This is a prefix term.
SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘NEAR(SearchWord1, SearchWord2)’) You are searching for rows where the FTcolumn contains SearchWord1 and SearchWord2. This is the simplest custom proximity term. In this simplest version, it searches only for occurrences of both words, no matter what the distance and order of terms. The result is similar to a simple term where two words or phrases are connected with the logical AND operator.
SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘NEAR((SearchWord1, SearchWord2), distance)’) You are searching for rows where the FTcolumn contains SearchWord1 and SearchWord2. The order of the search words is not important; however, the distance is an integer that tells how many nonsearch terms can be maximally between the searched terms in order to qualify a row for the result set.
SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘NEAR((SearchWord1, SearchWord2), distance, flag)’) You are searching for rows where the FTcolumn contains SearchWord1 and SearchWord2. The two searched terms must be closer together than the distance. The flag can take values TRUE or FALSE; the default is FALSE. If the flag is set to TRUE, then the order of the searched terms is important; SearchWord1 must be in text before SearchWord2.
SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘FORMSOF(INFLECTIONAL, SearchWord1)’) This is the generation term format of the predicate. You are searching for the rows where the FTcolumn includes any of the inflectional form of the word SearchWord1.
SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘FORMSOF(THESAURUS, SearchWord1)’) This is again the generation term format of the predicate. You are searching for the rows where the FTcolumn includes either the word SearchWord1 or any of the synonyms for this word defined in the thesaurus file.
SELECT…FROM…WHERE CONTAINS(FTcolumn, ‘ISABOUT(SearchWord1 weight(w1), SearchWord2 weight(w2))’) This is a weighted term. Weights have influence on the rank of the documents returned. However, because the CONTAINS predicate does not rank the results, this form has no influence on it. The weighted form is useful for the CONTAINSTABLE function.
SELECT…FROM…WHERE CONTAINS(PROPERTY(FTcolumn, ‘PropertyName’), ‘SearchWord1’) This is a property search. You need to have documents with some known properties. In such a query, you are searching for rows with documents that have the property PropertyName that contain the value SearchWord1.
The FREETEXT Predicate
The FREETEXT predicate is less specific and thus returns more rows than the CONTAINS predicate. It searches for the values that match the meaning of a phrase and not just exact words. When you use the FREETEXT predicate, the engine performs word breaking of the search phrase, generates inflectional forms (does the stemming), and identifies a list of expansions or replacements for the words in the searched term with words from the thesaurus. The form is much simpler than the form of the CONTAINS predicate: SELECT…FROM…WHERE FREETEXT(FTcolumn, ‘SearchWord1 SearchWord2’). With this, you are searching for rows where the FTcolumn includes any of the inflectional forms and any of the defined synonyms of the words SearchWord1 and SearchWord2.
Practice: Using the CONTAINS and FREETEXT Predicates
After you create all components needed for a full-text search solution, it is time to start using the full-text search.
If you encounter a problem completing an exercise, you can install the completed projects from the companion content for this chapter and lesson.
EXERCISE 1 Use the CONTAINS Predicate
In this exercise, you use the CONTAINS predicate. In addition, you edit and use a thesaurus file.
If you closed SSMS, start it and connect to your SQL Server instance. Open a new query window by clicking the New Query button.
Connect to your TSQL2012 database.
Find all rows where the docexcerpt column of the dbo.Documents table includes the word “data”. Use the following query.
SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N'data');
Find all rows where the docexcerpt column of the dbo.Documents table includes the word “data” or the word “index”. Use the following query.
SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N'data OR index');
Find all rows where the docexcerpt column of the dbo.Documents table includes the word “data” and not the word “mining”. Use the following query.
SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N'data AND NOT mining');
Find all rows where the docexcerpt column of the dbo.Documents table includes the word “data” or the words “fact” and “warehouse”. Use the following query.
SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N'data OR (fact AND warehouse)');
Find all rows where the docexcerpt column of the dbo.Documents table includes the phrase “data warehouse”. Use the following query.
SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N'"data warehouse"');
Find all rows where the docexcerpt column of the dbo.Documents table includes words that start with the prefix “add”. Use the following query.
SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N'"add*"');
Find all rows where the docexcerpt column of the dbo.Documents table includes the word “problem” anywhere near the word “data”. Use the following query.
SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N'NEAR(problem, data)');
Find all rows where the docexcerpt column of the dbo.Documents table includes the word “problem” anywhere near the word “data”. Try it with a query that searches for excerpts where the words are fewer than five and then with a query where the words are fewer than one nonsearch terms away. From the following two queries, the first one should return one row and the second one no rows.
SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N'NEAR((problem, data),5)'); SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N'NEAR((problem, data),1)');
Find all rows where the docexcerpt column of the dbo.Documents table includes the word “problem” anywhere near the word “data”. Try it with a query that searches for excerpts where the words are fewer than five nonsearch terms away. However, specify that the word “problem” must be before the word “data”. Use the following query.
SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N'NEAR((problem, data),5, TRUE)');
Find all rows where the docexcerpt column of the dbo.Documents table includes the word “presentation”. Try with a query that searches for exact match and with a query that searches for any inflectional form of the word. From the following two queries, the first query should return no rows and the second query one row.
SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N'presentation'); SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N'FORMSOF(INFLECTIONAL, presentation)');
EXERCISE 2 Use Synonyms and FREETEXT
In this exercise, you edit and use a thesaurus file to add a synonym.
Use Notepad to edit the thesaurus file for the US English language. Add a synonym “necessity” for the word “need”. The file to edit is the tsenu.xml file, located in a default installation in the C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\FTData folder. If you didn’t use the default path for the installation or use a nondefault instance, then the path in a general form is SQL_Server_install_path\Microsoft SQL Server\MSSQL11.Instance_id\MSSQL\FTData. Clear the XML comments from the file.
After editing, the content of the file should be as follows.
<XML ID="Microsoft Search Thesaurus"> <thesaurus xmlns="x-schema:tsSchema.xml"> <diacritics_sensitive>0</diacritics_sensitive> <expansion> <sub>Internet Explorer</sub> <sub>IE</sub> <sub>IE5</sub> </expansion> <replacement> <pat>NT5</pat> <pat>W2K</pat> <sub>Windows 2000</sub> </replacement> <expansion> <sub>run</sub> <sub>jog</sub> </expansion> <expansion> <sub>need</sub> <sub>necessity</sub> </expansion> </thesaurus> </XML>
Load the thesaurus file for US English.
EXEC sys.sp_fulltext_load_thesaurus_file 1033;
Find all rows where the docexcerpt column of the dbo.Documents table includes the word “need” or its synonym. Try with a query that searches for exact match and with a query that searches for synonyms of the word. From the following two queries, the first query should return no rows and the second query one row.
SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N'need'); SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(docexcerpt, N'FORMSOF(THESAURUS, need)');
Search for all rows from the dbo.Documents table where the document in the doccontent column contains a property called “Authors” with a value that includes the word “Dejan”. Use the following query.
SELECT id, title, docexcerpt FROM dbo.Documents WHERE CONTAINS(PROPERTY(doccontent,'Authors'), 'Dejan');
Finally, find all rows where the docexcerpt column contains any of the words “data”, “presentation”, or “need”. The words can be in any inflectional form. Search for synonyms as well. Use the following query.
SELECT id, title, doctype, docexcerpt FROM dbo.Documents WHERE FREETEXT(docexcerpt, N'data presentation need');
Lesson Summary
You can use the CONTAINS predicate for selective searches.
The FREETEXT predicate can be used for more general searches.
Lesson Review
Answer the following questions to test your knowledge of the information in this lesson. You can find the answers to these questions and explanations of why each answer choice is correct or incorrect in the Answers section at the end of this chapter.
Which of the following is not a part of the CONTAINS predicate?
FORMSOF
THESAURUS
NEAR
PROPERTY
TEMPORARY
Which form of the proximity term defines the distance and the order?
NEAR((SearchWord1, SearchWord2), 5, TRUE)
NEAR((SearchWord1, SearchWord2), CLOSE, ORDER)
NEAR((SearchWord1, SearchWord2), 5)
NEAR(SearchWord1, SearchWord2)
What can you search for with the CONTAINS predicate? (Choose all that apply.)
Inflectional forms of a word
Synonyms of a searched word
Translations of a word
Text in which a search word is close to another search word
A prefix of a word or a phrase only