When you search for a term, for example
BRCA2, we will break it up into smaller words, according to an Edge Ngram Tokeniser. We configure this tokeniser to treat letters (and digits) as tokens to produce a given choice of grams. For example, a trigram (gram of length of three) of
carcinoma will give
omatokens. For the current implementation, we have chosen a four-gram tokeniser.
The search is performed across multiple fields in our in our ElasticSearch index, such as full name of a gene, approved name of a gene, approved gene symbol, description of the disease, and EFO synonyms for a disease. Each of these fields will be weigh differently: we boost the terms found under "full name" and "approved name" and down weight the terms found in the "description" or "EFO synonyms" fields. We will then apply a function based on the the number associations and finally return the results of our search in our user interface. You can then choose to refine the results by "targets' or "diseases".
This optimisation allows us to return results that were previously missed before the four-gram tokeniser implementation. In the past, when searching for Charcot-Marie-Tooth disease, for example, we would have returned no results. When searching for Charcot-Marie-Tooth and refining the results by disease, you will get the expected results.
Another advantage of the tokeniser is that you can now search for incomplete terms or names e.g. “alzhe” and get Alzheimer's disease and other terms (sub-types of Alzheime'rs or targets) related to thi disease, whereas in the past no results would have been returned at all.
This is a much welcome improved on our search and autocomplete functionalities. However, a side effect of the this implementation is that we now return a higher number of terms than we used to.
If you want to discuss this further, please email us.