Text Analytics? Be Careful of Your Search Engine
You have to get to unstructured content in its entirety, and then eliminate the unnecessary. The current mode of operation by the majority of vendors is to force unstructured content into a database field, where it is available for the typical analysis. Won’t work, folks.
You could use linguistic techniques, but these tend to identify only a small proportion of concepts found within unstructured text. Semantic networks can also be a stumbling block. For those of you who remember Convera – is there anyone? – you may recall there are probably remnants of the technology in today’s SharePoint search.
Anyway, Convera was attempting to solve an intractable problem. Its manually constructed semantic network attempted to define a synonym ring for every concept used in any vertical sector and translated into every language. Over $1 billion was invested by shareholders, but this investment was lost when Convera finally stopped trying to ‘boil the ocean’ in August 2007, when it was sold to FAST for $23 million. The product was subsequently withdrawn from the market.
Concept Searching introduced compound term processing in 2003. At that time, and still today, a breakthrough technology that can identify and weight multi-word concepts based on a purely statistical analysis. The technology understands the relationships between words but is independent of vocabulary, grammatical style, and language. It wasn’t until 2008 that Autonomy introduced IDOL 7, a multi-word concept identification technology that appeared to be comparable with the Concept Searching technology.
What does this mean to end users who are searching, or to professionals who are trying to make sense of enormous amounts of content to solve a business problem? Well, it is a problem. For example, if you created a search as follows: “Donald Trump the President of the United States,” many search engines would rank the following words and return content that contained any of the words in bold – clearly what we don’t want: “Donald Trump the President of the United States.”
This happens because most search engines rank the text equally. Documents that match all the keywords will be ranked significantly higher. A linguistic technique can work with the presence of noun phrases and proper nouns, which have been added to the lexicon as multi-word concepts. These can then be used to improve the weighting of the first document relative to the second, but only because the query text contains proper nouns.
Sometimes we are looking for information about a particular topic but the concept is nebulous and difficult to articulate precisely. The difficulties are compounded, if there is uncertainty about the presence of documents and the exercise is designed to gather, or to prove the absence of, information about the selected topic. All traditional search techniques will struggle with this type of matching. If techniques that favor precision are used, such as Boolean expressions or phrase searching, then recall will suffer. If techniques that favor recall are used, such as word processing, then precision will suffer.
If you are using your search engine to identify highly granular content, be careful of your search engine, and take the time to understand the technology behind the interface.
Join us for our What You Don’t Know May Hurt You – Achieving Insight and Knowledge Discovery webinar, on Tuesday, November 14th. This session shows how text analytics and mining can boost the bottom line, through insight and knowledge discovery. Our guest speaker is Russ Stalters, information management strategist and former BP executive, who will explore real-life knowledge discovery scenarios, and discuss the significant return on investment achieved using text analytics.