Why Concept Searching

Company Overview   Our History Meet Some of Our Team  What We Do 

In this article we take a look at the data discovery and classification capabilities of Concept Searching’s conceptClassifier platformconceptTaxonomyManager, and conceptTaxonomyWorkflow.

The security landscape is constantly changing, with cybercriminals becoming more sophisticated, to the extent that organizations are often unaware they have been breached. Internal or accidental misuse of data remains one of the most common reasons for data compromise. Savvy organizations have moved away from building an impenetrable on-premises fortress, and are now considering the security of the cloud, cell phones, laptops, and personal technology tools – essentially, every end point in their organizations.

However, the stumbling block is the inability to identify what is contained within data, specifically unstructured data, but also structured and semi-structured data. Surveys repeatedly indicate that data is being viewed as a valuable business asset, yet executives admit they don’t know what their data contains. This significantly increases organizational risk and renders the data of no value.

There has been a huge increase in the availability of security, audit, and risk mitigation software to protect organizations, and manage users and devices, regardless of where they reside. The weak link is the data discovery, classification, and remediation process. Despite advances in cybersecurity solutions, classification and remediation capabilities, offerings focus primarily on the use of regular expressions, to find privacy data, for example, but do not provide a panacea for understanding data at the contextual level.

Keywords and proximity approaches, often used in data discovery applications, are still not able to capture the essence of content, or identify similar content with the same subject or topic but different keywords. Concept Searching’s technology differs in that it can identify ‘intelligent context in content’ – meaning the insight engine is able to generate multi-term metadata that indicates the ‘aboutness’ of the data.

Back to Top

Originally launched as an enterprise search engine in 2002, the Concept Searching system was optimized from the beginning to scale and deliver high performance in terms of recall and precision – the two key performance indicators for search.

conceptSearch is still sold as an enterprise search engine to organizations that have specific search requirements, such as those in government intelligence. Because of its original use, the insight engine crawls content and enables lightweight collection of terms. The terms are stored in a database, with no impact on the infrastructure, and there are no agents to install on end points, creating no additional burden to the platform environment or another surface of attack.

Concept Searching technologies deliver automatic intelligent metadata generation, auto-classification, and taxonomy management. Unlike traditional tools, the classification results enable organizations to overcome the seemingly insurmountable challenge of identifying intelligent content in context. The ability to identify security risks using this approach is flexible, and has proved exceptionally useful in addressing business process failures in records managementinformation securityintelligent migrationtext analytics, and secure collaboration.

One of the key advantages is a single source of truth, through centralized information management initiatives. As content is classified, the automatic metadata generation populates an enterprise metadata repository. This serves as a single point of metadata capture and reference, enabling the deployment of processes across the enterprise and eliminating siloed repositories of data. This provides a holistic view of statistically generated metadata, and enables taxonomy administrators to manage it from a single repository. Taxonomy administrators can belong to any organizational group, such as IT or security, so can be any individuals who understands their portion of the corpus of content. The conceptTaxonomyManager component also provides node locking, rollback, and auditing features. Additional taxonomies can be deployed to support business units with specific requirements.

Unique Intellectual Property
The core technology is based on compound term processing. Concept Searching is the only classification vendor with technology that statistically calculates the value of word strings that form a concept and, in turn, generates multi-term metadata. The metadata can be a single word term or multi-word patterns. These patterns can identify concepts based on one, two, or three words, and occasionally four or five words. Utilizing Concept Searching’s compound term processing, the technologies deliver a set of outcomes that are not achieved by any other classification engine.

Compound term processing means that Concept Searching’s statistical engine can understand, out of the box, the incremental value of keywords, multi-word fragments, and compound terms. As a result, it identifies concepts resident within an organization’s own information repositories that are highly correlated to particular topics. The identification of these topics in this way delivers automatically generated, intelligent metadata that is unique to the organization.

The words ‘triple,’ ‘heart,’ and ‘bypass’ all have different meanings. Using compound term processing, a search for ‘survival rates following a triple heart bypass’ will locate documents about this topic, even if this precise phrase is not contained in any document. A concept search using compound term processing can extract the key concepts, in this case ‘survival rates’ and ‘triple heart bypass,’ and use these concepts to select the most relevant documents.

Concept Searching

The Concept Searching technology enables the rapid creation of multi-term metadata, which can be classified to organizationally-defined taxonomies. The tagging and auto-classification of content can be aligned to business goals, and the semantic metadata generated can be easily integrated with any third-party application or platform that can interface through web services. By making these compound terms available to any application that requires metadata, outcomes are highly accurate because the ambiguity inherent in single words is not an issue.

Ambiguity is a key concern and its limitations pose a risk for organizations, often making it necessary for them to adapt vendors’ products in order to support a unique organizational vernacular. The impact of this is increased cost, risk, and manpower due to aligning product and data repositories, typically building rules to support the organizational vocabulary and substantial, reiterative testing. These requirements rely greatly on the human element and can obscure or even derail the fundamental objectives of a project, whether it relates to compliance, security, privacy, or a host of applications where data classification is a necessity.

This unique technology overcomes the limitations that have plagued information retrieval and data classification vendors still reliant on overused and time-worn metadata capture techniques, which depend upon keywords, proximity or language packs. These are unable to produce the essence or meaning from an organization’s unstructured, semi-structured, and structured data, without costly customization.

Adopting recent buzzwords, many vendors offer solutions that claim to use artificial intelligence (AI) or natural language processing (NLP), yet most require rules to be created using Boolean expressions, needing knowledgeable staff and time to develop, test, and deploy.

Taking a fundamentally opposite approach, the dynamic and intelligent insight engine within the conceptClassifier platform substantially reduces the manpower required for customizing organizational vocabulary, and improves the precision of data classification results. Using the conceptTaxonomyManager component within the platform, authorized taxonomy administrators experience a highly interactive and supportive environment, where rules are supplied and new rules can be developed in minutes.

The value of data classification spans a broad set of application uses that fall under the larger umbrella of information governance. The auto classification capabilities identify data that may be hidden, noncompliant, or contain privacy and sensitive information exposures. The taxonomies dynamically adapt as the business data evolves. Aligned to the data repositories, which are in a state of flux, the classification process automatically identifies data that has been changed, moved, or deleted. The taxonomy and automated classification form the foundation for identification of data posing risk, as well as data of unknown value, enabling organizations to realize business benefits by leveraging the capabilities of the conceptClassifier platform.

Since the insight engine can identify concepts, subjects, or topics, the classification results are highly accurate because the ambiguity inherent in single words is no longer a problem. For example, administrators can create or customize rules to be used for the identification of sensitive or confidential information. Authorized users can rapidly create a pattern for detection, using any identifiable verbiage or descriptors. Since the insight engine understands meaning to generate the multi-term metadata from within the data, inter or intra-related phrases, keywords or entities are also identified, even if they do not match the pattern specified. This is an additional benefit and provides another layer of detection of potential exposures, in the case of protecting privacy and sensitive information.

Auto-classification scans the contents of a document and automatically assigns categories, keywords, or entities found in the document. It will also use system-generated metadata or metadata assigned by a user. Auto-classification solutions vary in the degree of complexity and time required to achieve optimal classification. Some require very large training sets, multiple iterations, and rules for each term, which are sometimes complex, necessitating repetitive testing and re-indexing the test data, assuming the test data is accurate.

The Concept Searching auto-classification technology has the capability to automatically group unstructured content, based on an understanding of the concepts that share mutual attributes, while separating dissimilar concepts. The conceptTaxonomyManager component provides a hierarchical view of topics that have been grouped as they share the same quality or characteristic. Because of Concept Searching’s compound term processing technology, documents are automatically classified in the taxonomy, based on their relationships and relevance based on concepts. Documents may exist in multiple categories, as one document may contain multiple concepts.

The automation of the process is done transparently, without user involvement, to handle the appropriate disposition of content. This includes discovering where the content resides, cleansing it through organizationally-defined concepts and descriptors, identifying the relationships within the content, and then applying policies and automating enforcement, using risk and information governance security solutions.

Supporting both automatic and manual classification, administrators can utilize rich features such as node weighting, seeing the ‘concepts in context,’ searching the corpus, auto-clue suggestion for classification, and instant feedback on the impact of changes. The taxonomy provides the structure for the grouping of like documents, and enables a more targeted, accurate, and efficient management and tuning tool that translates into reduced costs and improved productivity. Many of these features are unique in the industry.

Natural Language Processing
Natural language processing (NLP), which Concept Searching incorporates for entity extraction and stemming, focuses on the interaction between machines and human, natural, languages. It attempts to understand human input, break down syntax to comprehend meaning, and determine appropriate action. The programming required must be precise, unambiguous, and highly structured. The continual analysis of patterns in data improves the understanding of the indexing engine. NLP is typically used in named entity extraction, where pre-classified categories are clearly defined and do not need to be interpreted.

Preconfigured systems aligned with industries are not able to capture unique vocabulary in a corpus of content, or not without significant effort. Linguistic products are highly dependent on vocabulary, style, and language, making them cumbersome to modify. Most frequently used in responding to the spoken word, the linguistic structure can depend on complex variables such as slang, regional dialects, and social contexts. As with any digital system, it is only as good as the data it receives. It is up to people to discern between fact and fiction. Unfortunately, content is not always predictable.

Taxonomy Management
Taxonomy and metadata have a co-dependent relationship. The structure of the taxonomy and the metadata are reciprocal elements that work together to create the information architecture. Taxonomies provide the visual organization and structure for organizing content, which metadata does not provide. At the same time, metadata provides more descriptive information about the content, to improve access and use. Intelligent metadata generation improves workflows and business applications that use metadata. The intelligent metadata is generated as content is created or ingested, and identifies how data elements are related, as well as the meaning of content, accomplished through Concept Searching’s unique, compound term processing engine.

Eliminating complex Boolean rules and the need for reiterative training sets, taxonomy nodes can be automatically generated from compound terms found in the document corpus. Taxonomy administrators have full control over the terms used and the weighting of terms, based on relevancy. This enables a much more robust taxonomy, as terms are suggested based on an organization’s own content and can offer administrators new terms from relevant documents that may not have been identified.

All the conceptClassifier platform rules can be easily modified and new rules rapidly created, enabling administrators and subject-matter experts to see the impact of changes and adjust as necessary to match specific organizational requirements, all in real time. The base product comes with over 80 security and compliance rules, which are clue-based, enabling multiple terms or phrases to be added and combined, to match certain thresholds. This goes far beyond the limitations of keywords, proximity, or regular expressions.

Back to Top

Both the public and private sectors continue to build complex data flows reaching far beyond the confines of an organization. Information is routinely communicated to clients, vendors, partners, consultants, or government agencies. This makes prevention of data breaches and data compromise complex and challenging. The increase in the use of cloud technology and the rise of data theft bring internal and external security challenges. Risk is different for every organization, be it regulatory, intellectual property protection, cybersecurity, eDiscovery, data retention, or even the use of information in unintended ways. Tolerance for risk must be determined by each organization. As risk applies to data, the formidable challenge has been the inability to identify risk factors, analyze them, put in place proper policies, and take action.

Metadata is the business enabler that can no longer be ignored. The traditional approach of placing the onus on end users to accurately tag data is not an option. The most widely used metadata generation techniques are not viable for data discovery and classification. It is only when statistically generated multi-term metadata is available that value can be consistently achieved. The conceptClassifier platform, with its ability to extract content in context, forms the framework for an easily managed metadata repository, and addresses the issue of mediocre data discovery and classification that leave organizations vulnerable.

The need for an information governance strategy for unstructured content is a priority. Content is accumulated from a variety of sources, and must be in managed in line with organizational information governance strategy and tactics. The result is the ability to manage the quality of information as well as its lifecycle, while reducing organizational risk and improving business performance.

Back to Top

Concept Searching
Concept Searching