当我从事研究生项目(使用 SQL Server 2012 语义搜索进行文本挖掘)时,我遇到了需要在这个网站上发布问题的情况,希望有人能帮助我。
这个问题是关于SQL Server 2012 中的停用词列表和停用词。我已经设置了一个概念证明,我正在尝试使用新的语义搜索功能索引文档并列出统计相关的关键短语。因为我不希望某些词被索引,因此统计相关的关键短语,我正在创建一个停止列表来排除这些词。
英语的停用词/停用词 ( lcid 1033 ):
/* Create stoplist and add words */
CREATE FULLTEXT STOPLIST [naam van de stoplist];
ALTER FULLTEXT STOPLIST [naam van de stoplist] ADD 'beeten' LANGUAGE 'English';
ALTER FULLTEXT STOPLIST [naam van de stoplist] ADD 'centimeter' LANGUAGE 'English';
ALTER FULLTEXT STOPLIST [naam van de stoplist] ADD 'info' LANGUAGE 'English';
ALTER FULLTEXT STOPLIST [naam van de stoplist] ADD 'ruud' LANGUAGE 'English';
GO
使用自定义停止列表和语义创建全文目录、全文索引:
/* Full-Text catalog */
CREATE FULLTEXT CATALOG [ft] WITH ACCENT_SENSITIVITY = ON AS DEFAULT;
GO
/* Full-Text Index */
CREATE FULLTEXT INDEX ON [dbo].[Documents]
( file_stream Language 1033 STATISTICAL_SEMANTICS )
KEY INDEX DocumentsFt
WITH STOPLIST = [naam van de stoplist];
GO
我尝试了我能想到的一切来检查我是否错过了什么:
/*Select all words in the stoplist, with some debug information*/
SELECT sys.fulltext_stoplists.stoplist_id AS [Stoplist id]
, sys.fulltext_stoplists.name AS [Stoplist]
, sys.database_principals.name AS [Owner]
, sys.fulltext_languages.lcid AS [LCID]
, sys.fulltext_languages.name AS [Taal]
, sys.fulltext_stopwords.stopword AS [Stopwoord]
FROM sys.fulltext_languages
INNER JOIN sys.fulltext_stopwords
ON sys.fulltext_stopwords.language_id = sys.fulltext_languages.lcid
INNER JOIN sys.fulltext_stoplists
ON sys.fulltext_stopwords.stoplist_id = sys.fulltext_stoplists.stoplist_id
INNER JOIN sys.database_principals ON sys.database_principals.principal_id = sys.fulltext_stoplists.principal_id
WHERE sys.fulltext_stoplists.name = 'naam van de stoplist';
/* List with all Full-Text Indexes (with statistical_semantics) */
SELECT sys.fulltext_catalogs.name [Full-Text catalog]
, sys.indexes.name AS [Index]
, sys.indexes.type_desc AS [Index type]
, sys.fulltext_indexes.is_enabled AS [Index in use]
, sys.fulltext_stoplists.name AS [Stoplist]
, sys.tables.name AS [Table]
, sys.columns.name AS [Column]
, sys.fulltext_index_columns.language_id AS [LCID]
, sys.fulltext_languages.name AS [Language]
, sys.fulltext_index_columns.statistical_semantics [Semantic]
FROM sys.fulltext_catalogs
INNER JOIN sys.fulltext_indexes
ON sys.fulltext_catalogs.fulltext_catalog_id = sys.fulltext_indexes.fulltext_catalog_id
INNER JOIN sys.fulltext_index_columns
ON sys.fulltext_indexes.object_id = sys.fulltext_index_columns.object_id
INNER JOIN sys.indexes
ON sys.fulltext_indexes.object_id = sys.indexes.object_id
AND sys.fulltext_indexes.unique_index_id = sys.indexes.index_id
INNER JOIN sys.index_columns
ON sys.indexes.object_id = sys.index_columns.object_id
AND sys.indexes.index_id = sys.index_columns.index_id
INNER JOIN sys.columns
ON sys.index_columns.object_id = sys.columns.object_id
AND sys.index_columns.column_id = sys.columns.column_id
INNER JOIN sys.tables
ON sys.fulltext_indexes.object_id = sys.tables.object_id
INNER JOIN sys.fulltext_languages
ON sys.fulltext_index_columns.language_id = sys.fulltext_languages.lcid
LEFT JOIN sys.fulltext_stoplists
ON sys.fulltext_indexes.stoplist_id = sys.fulltext_stoplists.stoplist_id
WHERE sys.fulltext_index_columns.statistical_semantics = 1
ORDER BY sys.fulltext_catalogs.name
,sys.indexes.name
,sys.index_columns.key_ordinal;
/* Rebuild catalog */
ALTER FULLTEXT CATALOG [ft] REBUILD;
GO
/* Check status of the catalog rebuild */
/* 0 = Idle.
1 = Full population is in progress.
2 = Incremental population is in progress.
3 = Propagation of tracked changes is in progress.
4 = Background update index is in progress, such as automatic change tracking.
5 = Full-text indexing is throttled or pause
*/
SELECT FULLTEXTCATALOGPROPERTY('ft', 'PopulateStatus') AS Status;
GO
/* Repopulate Full-Text Index */
ALTER FULLTEXT INDEX ON dbo.Documents START UPDATE POPULATION;
GO
上面的所有命令都表明设置正确。
当我查看索引词时,我仍然会看到停止列表中的词,例如“beeten”。
SELECT *
FROM sys.dm_fts_index_keywords(DB_ID('SQLServerArticles'), OBJECT_ID('Documents'))
WHERE display_term = 'beeten';
如果全文解析器无法与以下语句一起正常工作,我什至尝试过。
SELECT special_term, display_term
FROM sys.dm_fts_parser
(' "testing for fruit and nuts centimeter, any type of Beeten" ', 1033, 8, 0)
此语句返回以下结果:
Exact Match testing
Exact Match for
Exact Match fruit
Exact Match and
Exact Match nuts
Noise Word centimeter
Exact Match any
Exact Match type
Exact Match of
Noise Word beeten
这个结果表明单词“beeten”是一个噪声词。索引时应该跳过这个词吗?我错过了什么?
再说一遍:因为我不希望某些词被索引,因此在统计上相关的关键短语,我正在创建一个停止列表来排除这些词。
如果您的系统区域设置与英语不同,则存在一个已知错误(Microsoft Connect 项目753596),其中使用系统区域设置停用词而不是存储在文件表中的文档的全文索引停用词。