
Websolr allows users to maintain their own synonyms list for each index via the dashboard. How Do I Use the WordNet list with Websolr? Tl dr: Do not simply assume that chucking a massive synset collection at your index will make it faster with more relevant results. The takeaway is that WordNet is not a panacea for relevancy tuning, and it may introduce unexpected results unless you’re doing a lot of preprocessing or additional configuration.

There are some really great examples of what this means here (this is documentation for Elasticsearch, but the same principals are true for Solr). And it still doesn’t really address the problem of multi word synonyms. Namely, performing expansion and matching at query time adds overhead to your queries in terms of server load and latency. And WordNet includes multi-term synonyms in its database, which can break phrase queries.Įxpanding synonyms at query time resolves some of those issues, but introduces others. If you ever want to change your synonym list, you’ll need to reindex everything from scratch. Larger index sizes often correspond to memory issues as well. This has several consequences: slower indexing speed, higher load during indexing, and significantly more disk use. When synonyms are expanded at index time, Solr uses WordNet to generate all tokens related to a given token, and writes everything out to disk. WordNet can introduce all of these issues with varying severity. Synonym expansion can be really tricky and can result in unexpected sorting, lower performance and more disk use. Relevancy tuning can be a deeply complex subject, and WordNet – especially when the complete file is used – has trade offs, just like any other strategy. Some use cases require features like synonym processing, for which a lexical grouping of tokens is invaluable. The WordNet has become extremely useful in text processing applications, including data storage and retrieval. You can read more about the structure and precise definitions of WordNet entries in the documentation. The line also indicates ‘kitty’ is the fourth most commonly used term according to semantic concordance texts. This line expresses that the word ‘kitty’ is a noun, and the first word in synset 102122298 (which includes other terms like “kitty-cat,” “pussycat,” and so on). An entry in WordNet looks something like this: WordNet is essentially a text database which places English words into synsets - groups of synonyms - and can be considered as something of a cross between a dictionary and a thesaurus. This is great for solving the proximate issue, but what it can get extremely tedious to define all groups of related words in your index. A typical synonyms.txt file might look like this: This lets search administrators define groups of related terms and even corrections to commonly misspelled terms. Solr has a mechanism for defining custom synonyms, through the SynonymFilterFactory. For example, say that your user searches for “bowtie pasta.” You may have a product called “Funky Farfalle” which is related to their search term but which would not be returned in the results because the title has “farfalle” instead of “bowtie pasta”. You want users to be able to search for those products, but you want that search to be smart. Let’s say that you have an online store with a lot of products.

Make sure you understand the trade offs (discussed below) well before setting it up. It can offer major improvements in relevancy, but it is not at all necessary for many use cases.

Experiments on several real-life domains demonstrate the effectiveness of our proposed method.WordNet is a huge lexical database that collects and orders English words into groups of synonyms. We cast the synonym discovery problem into a graph-based ranking problem and demonstrate the existence of a closed-form optimal solution for outputting entity synonym scores. A general, heterogeneous graph-based data model which encodes our problem insights is designed by capturing three key concepts (synonym candidate, web page and keyword) and different types of interactions between them. Unlike existing query log-based methods, we delve deeper to explore sub-queries, and exploit tailed synonyms and tailed web pages for harvesting more synonyms. In this work, we propose adopting a "structured" view of each entity by considering not only its string name, but also other important structured attributes. Previous works often take a "literal" view of the entity, i.e., its string name. Abstract: With the increasing use of entities in serving people's daily information needs, recognizing synonyms-different ways people refer to the same entity-has become a crucial task for many entity-leveraging applications.
