conceptSearching - Retreival just got smarter.
RETRIEVAL JUST GOT SMARTER

The technical team at Concept Searching are keen to support the technical needs of its resellers and customers. The following provides a list of FAQ's to assist any first level enquiries. If you have any more specific questions please mail them to the following email address and we will review them promptly.

info-uk@conceptsearching.com
info-usa@conceptsearching.com
info-africa@conceptsearching.com

Why do we need another IR system?
What is Bayesian Inference?
What is the Probabilistic Model?
Why is the Probabilistic Model superior to traditional free text systems?
What is Probabilistic Latent Semantic Indexing?
What is relevance feedback?
What is conceptSearching and how does it compare to simple keywords searching?
What is Shannon's Information Theory?
What is Language Stemming?
What is Dynamic Summarization?
What about Classification and Support for Taxonomies ?
Can I call conceptSearching from an ASP/COM + application?
What types of document can I store?
What languages are supported?
How scalable is conceptSearching?
Why is a SQL database required?

 

Why do we need another IR system?

No other vendor on the market today implements the latest ideas on IR theory, packages their products in an ideal manner for today's leading web server environments and then offers the system for sale at a sensible price such that it can be implemented within all information retrieval related applications. In addition, the leading IR vendors have a reputation for providing programming interfaces that are far from ideal and do not provide sufficient access to the internal statistical profiles so that application developers have limited ability to tune the system to meet specific end user requirements.

Web Services in a technology platform promoted by all of the major players in the software industry including: Microsoft, IBM, Sun, Intel, Hewlett-Packard, Oracle, BEA, Novell, and many more. Web Services are built on XML standards and offer interoperability between different platforms and especially between Microsoft's .NET platform and any of the J2EE implementations.

conceptSearching is a unique system in that it offers all of the following attributes:
  • Relevance Ranking based on The Probabilistic Model (Bayesian Inference)
  • Concept identification based on Shannon's Information Theory
  • Probabilistic Latent Semantic Indexing
  • Cross platform compatibility via Web Services
  • All Application Programming Interfaces (APIs) based on XML
  • Transparent access to system internals including the statistical profile of terms
  • High Precision and High Recall in both search applications and automatic document classification 

Top

What is Bayesian Inference?

Thomas Bayes was an eighteenth century mathematician who devised a theory for conditional probability:

             P(B ? A) P(A)
P(A ? B) = -----------------
                 P(B)
Conditional probability is the probability of some event given that some other event has already occurred. In the above equation the left hand term P(A ? B) is known as the posterior probability or the probability of some event A occurring given that event B has occurred is equal to the probability of event B occurring given that event A has occurred, multiplied by the probability of event A occurring and dividend by the probability of event B occurring.

The Probabilistic Model interprets Bayes' Theorem in an IR context where the probability that certain query terms are better differentiators between relevant and non-relevant documents than other query terms is evaluated given implicit or explicit relevance feedback.

Top

 What is the Probabilistic Model?

The Probabilistic Model was pioneered at Cambridge University during the 1970's and 1980's. The model is an application of Bayes' Theorem and defines a system for weighting individual query terms and documents based on:

  • The frequency of terms across the document collection (wcf)
  • The frequency of terms within a given document (wdf)
  • Normalized document length (ndl)
  • Explicit or implicit feedback on document relevance
In 1976 Professor Stephen Robertson and Karen Sparck Jones devised a formula for computing term weights and document weights and subsequently performed extensive evaluations on relevance feedback techniques using standard document collections. In 1994 Robertson introduced an extended model that was no longer based on a binary independence model and this work has strongly influenced the design of conceptSearching.

Top

Why is the Probabilistic Model superior to traditional free text systems?

Traditional free text systems are based on simple keywords and Boolean logic (primarily the AND, OR and NOT operators). Whilst this technique is very precise it does fall down when the number of documents retrieved is too large to examine exhaustively. In this case the ability to rank documents, with the most important ones at the top of the list, is of paramount importance. Over time the traditional systems have introduced various ways to rank results but this is not based on a sophisticated model of term profiles across the collection of indexed documents and tend to rely too heavily on a within document frequency (wdf) analysis.

The statistical model of term frequency across the document collection is unique to the Probabilistic Model. This model not only allows initial relevance ranking to be more accurate but it also provides a mechanism for iterative searching based on relevance feedback.

Top

What is Probabilistic Latent Semantic Indexing?

Probabilistic Latent Semantic Indexing (PLSI) is the ability to locate documents that are relevant to the user's query even if they do not contain any of the words in the user's query text. It is also about the ability to ignore documents that do contain words from the user's query but which are not relevant.

Probabilistic Latent Semantic Indexing (PLSI) is achieved by:
  • Relevance ranking the documents matched by the initial query
  • Extracting the distinguishing concepts from the most relevant documents
  • Expanding the query to include selected related concepts
The inclusion of related concepts can be done explicitly (user decides) or implicitly where related concepts are included automatically based on an understanding of the application area and/or user personalization.

Imagine searching for "portable computer" and finding documents that were about "laptops", "the Toshiba Tecra" and "notebooks" but where some of the retrieved documents do not contain any words from the original query - that's Latent Semantic Indexing.

Top

 What is relevance feedback?

Traditional IR systems provide a static mechanism to index documents and service retrieval requests. Relevance feedback is used to describe dynamic mechanisms that allow the retrievals to be tuned over time based on explicit or implicit feedback from the user(s). An example of implicit feedback would be where a user identifies individual documents that are relevant to their query. An example of implicit feedback would be where the system monitors the users activity to see what documents they examine, how long they spend looking at individual documents, what documents they author or perhaps a common pattern to their retrieval activity.

The Probabilistic Model allows this type of explicit or implicit feedback to be injected into the retrieval process so that the weightings applied are modified, or tuned, automatically to suit a particular user's requirements.

Top

 What is conceptSearching and how does it compare to simple keywords searching?

A Probabilistic implementation that worked on the basis that words appears in documents independently from other words would provide a reasonable level of accuracy. However, if the implementation understands that the co-location of words is relevant and should form part of the weighting process then a significant improvement in the relevance ranking can be achieved.

For example, consider the following query:

"dangerous dog attacks baby"

A human would interpret this phrase as being about a wild animal attacking an infant. However, a simple IR system that assumes that words appear independently from each other would assume that any document containing the phrase:

"dangerous virus attacks baby dog"

Would be 100 % relevant to the above query on the basis that it contains all of the words. Most humans would disagree.

conceptSearching uses Shannon's Information Theory to compute the incremental value of compound terms based on an analysis of the probability of the joint occurrence.

Top

 What is Shannon's Information Theory?

Claude Shannon, a scientist working at Bells Labs, published his information Theory in 1948 and this had an immediate and lasting impact on data communication technology. Shannon demonstrated that the value, or entropy, of a piece of information is proportional to its probability and the entropy of a joint event is given by:


conceptSearching interprets this in an IR context to compute the incremental value of a two-word term over its singleton components. Higher order compound terms are evaluated using their lower order compound components.
It is no coincidence that the majority of compound terms are in fact proper nouns, noun phrases and verb phrases and it is these sentence fragments that convey the key concepts in most text.

However, the concepts are identified without any linguistic analysis and so conceptSearching works with any vocabulary and is language dependent. The mathematical approach works because Shannon's theory can be applied to any human language communication.

The ability of an IR system to identify clusters of words that identify specific concepts represents a major advancement over systems that fail to do this. Apart from conceptSearching, we are aware of only one other company that implements Shannon's Information Theory for concept identification.

Top

What is Language Stemming?

Often a user will type in a query with one form of a word but would like to match other forms of what is essentially the same word.

In 1980 Dr Martin Porter, a member of the team working on a Probabilistic Model at Cambridge University developed a suffix-stripping algorithm that has been very widely adopted for normalizing words in IR systems.

Using Porter's algorithm the following words can be matched:

"dangerous" with "danger", "dangers" and "dangerous"
  "attacks" with "attack", "attacks", "attacker", "attackers" and "attacking"
  "baby" with "baby" and "babies"

In addition, with our fuzzy stemmer the following words can also be matched:

"misspelt" with "mispelt"
  "commission" with "commision", "comission", "commissioning" and "comisioned"
  "accommodate" with "accomodate" and "acomodation"

conceptSearching uses language stemming as part of its concept matching process, although individual words and phrases may be left unstemmed by enclosing with double quotes. This means that by default stemming broadens the matching process but where a particular word should be interpreted verbatim it can be easily excluded from the stemming process.

For more information about Dr Martin Porter's stemming algorithms please visit the Snowball site at: http://snowball.tartarus.org.

Top

What is Dynamic Summarization?

When a document is retrieved we normally need to display an extract from the document as an aid to the user when reviewing the returned document set. Most systems will display a static summary that is the same regardless of the user's query. conceptSearching can display static summaries. However, it can also apply a modified weighting system to identify short extracts that are most relevant to the user's query. The number, length and relevance threshold for these extracts are all configurable. The extracts will normally comprise whole sentences or short paragraphs.

Top

 What about Classification and Support for Taxonomies?

The conceptClassifier module can be used to classify documents into any predefined categories based on a small number of descriptors. Once classified the documents can then be applied to a corporate taxonomy and used for browsing the database or as a filter when running ad hoc queries.

conceptClassifier can classify around 100,000 documents per hour when using a medium sized taxonomy (such as the IPSV which has 2,700 nodes).

Top

 Can I call conceptSearching from an ASP/COM + application?

New application development on the Microsoft platform is rapidly moving to .NET and this environment make interfacing to Web Services very simple. However, many excellent products have been developed for the ASP/COM+ environment and migrating these to .NET would be a major undertaking. Fortunately, Microsoft has provided the SOAP Toolkit for ASP/COM+ developers and using this it is fairly straightforward to call Web Services running under .NET (or J2EE).

See our ASP/COM+ demonstration for sample code showing how to call conceptSearching QueryServer from an ASP page.

The SOAP Toolkit v3.0 is available as free download from microsoft.

Top

What types of document can I store?

conceptSearching has the following collectors:

  • HTTP collector - for spidering web pages
  • File collector - for documents located on file systems
  • SharePoint collector - for SharePoint portals
  • SQL collector - for documents held in a SQL database (e.g. SQLServer or Oracle)
  • XML collector - for custom document types

 

conceptSearching has native file conversation facilities for the following document types:

  • All HTML and XML formats
  • Adobe Portable Document Format (PDF)
  • Microsoft Word and Rich Text Formats
  • Microsoft Excel
  • Microsoft PowerPoint
  • Microsoft Rich Text Formats
  • Corel WordPerfect
  • Any other files in text format (e.g. TXT, CSV, etc)

In addition, third party iFilters can be used to convert virtually all other popular document formats (e.g. Microsoft Visio, email file formats, StarOffice documents, etc).

Top

 What languages are supported?

conceptSearching can index any text in the Roman alphabet including full support for diacritics. The use of diacritics within documents or queries is entirely optional so that fitchée will match with fitchee and vice versa. All information is exchanged, and managed internally, using UTF-8 and so support for non-roman alphabets (e.g. Kanji or Arabic) should be possible in the future.

The following 14 languages are automatically detected and processed:

- Afrikaans
- Danish
- Dutch
- English
- Finnish
- French
- Hungarian
- German
- Italian
- Norwegian
- Portuguese
- Spanish
- Swedish
- Welsh

Top


How scalable is conceptSearching?

The designers of conceptSearching have many years experience in implementing proprietary file systems and custom databases. In particular the database format has been designed to allow concurrent indexing at full speed whilst allowing simultaneous access for retrievals. This concurrency has been achieved in part by reducing the amount of file restructuring typically found in competitive systems, which are often based on B-tree structures. The selected design tends to produce an index database a little larger than some alternatives but with faster retrieval. In general conceptSearching will produce an index database whose size is directly proportional to the volume of text under index (i.e. 10GB of text will typically produce an index database of 10GB). The proprietary database format used by conceptSearching has been designed to provide optimum performance and concurrency.

For testing and development the entire system can be installed on a single computer. For live implementations the Query Server, Index Server and the Web Application would normally be distributed. A multi-server configuration will be capable of indexing about a million pages per day whilst simultaneously providing retrieval to thousands of concurrent users. For very large implementations multiple Query Servers could be configured with shared access from a pool of application servers.

There is also a Distributed Query Server so that very large indexes can be partitioned over a number of servers to improve indexing performance. 

Top

 Why is a SQL database required?

conceptSearching stores its probabilistic index in a proprietary database. However, the conceptIndexer uses a SQL database to manage the queue of documents to be indexed. The SQL database contains all information necessary to perform indexing, such as the individual filenames and URLs, access criteria, re-indexing frequency, inclusions and exclusions, etc. The SQL database may also be used to store any application specific meta-data. 

The SQL database can be either Microsoft SQLServer (2000 or later) or Oracle (8i or later)

Top