One of the main areas of concerns to search users is in making sure they use relevant terms. You know what its like, do you search for a Spade, Shovel, or gardening equipment. There’s always a lingering feeling that you might have missed some information. Then there’s the age old problem of American vs. British spellings – disc versus disk, labour versus labor, not to mention another language altogether. Also, thanks to the digital age language is changing more rapidly now than ever before. New words are being added to dictionaries online all the time. There are also ‘folksonomies’ – internal terms and languages specific to a type of data, organisation or profession.
So if you’re creating a search application you need to be able to help your search users along the path. MarkLogic can be used to help in a variety of ways. Out of the box you get Word stemming in a range of international languages. So a search for ‘run’ would also return ‘running’ and ‘runs’. This is automatic, and users expect this without even thinking of it as a requirement. There’s also the concept of fuzzy searches – i.e. common misspellings – in case the document you’re searching over has a typo (or your search user for that matter!). Lexicon suggestions are another technique often used with autocomplete text boxes where words are suggested that start with what the user is typing. These typically come from words already in the corpus of documents. I.e. the Lexicon. MarkLogic supports both these things too out of the box, with just a little configuration required.
What I want to spend a little time on today though are synonyms, or using a thesaurus within a search. Lets say you’re searching over a tonne of tweets or facebook posts for a word like ‘riot’. You want to see if anyone is currently planning some serious disorder in your nation’s capital. (Totally ficticious scenario… I don’t watch the news… honest). You may want synonyms to come back such as anarchy, brawl, disturbance or even phrases and yoof speak like ‘smashing it up’. Happily there is a quick way to load this information and instantly access it in search.
Firstly you get a thesaurus. A great place to start is thesaurus.com. You then put this in a format that MarkLogic can easily search over, like this:-
Now you alter your search:search options to configure a constraint so that you can either search for just the word ‘riot’ (default) or synonyms of riot with something like ‘thes:riot’. Here’s an example constraint:-
<parse apply=”parse” ns=”http://marklogic.com/xdmp/thesaurus” at=”/app/models/lib-thesaurus.xqy” />
You’ll notice this is a ‘custom constraint’. You can create these for any complex processing. The function ‘parse’ just needs to return any cts:query element. I’m not using this as a facet, but you can also add a start and finish function for that too. (See search:search in the MarkLogic XQuery API documentation). Here’s the contents of my custom thesaurus module. Note I’m referring to my specific thesaurus. You could set up a different keyword for each thesaurus. E.g. one for dictionary synonyms, one for ‘drug speak’ or ‘youth speak’, and one for ‘organisation terms’.
xquery version "1.0-ml";
module namespace facet="http://marklogic.com/xdmp/thesaurus";
import module namespace search = "http://marklogic.com/appservices/search" at "/MarkLogic/appservices/search/search.xqy";
import module namespace thsr="http://marklogic.com/xdmp/thesaurus" at "/MarkLogic/thesaurus.xqy";
declare function facet:parse(
$constraint-qtext as xs:string,
$right as schema-element(cts:query))
let $term := fn:string($right//cts:text)
let $th := thsr:expand(
Once this is all loaded you’ll get results back. Note I’m weighting synonyms lower that the word on its own. This means a user just has to search for the word with the most precise meaning, and MarkLogic will weight the results accordingly to their request.
UPDATE: I should probably explain why I’m returning a cts or-query. I wanted to ensure my results still included the original weighting for the word itself rather than the 0.25 weighting, hence the or query. Also, the </root>/* is quick fix from my colleague Dave Cassel’s blog – the method signature has to be schema-element. If you just returned the or-query, it would be an element. Not sure why this doesn’t auto cast – perhaps a good article for a future blog…