Healthcare Organization

Case Study

  Industry Case Studies    All Case Studies

Content Optimization – Reducing Risk, and Ensuring Compliance and Information Governance

“Organizations are beginning to wake up to the fact that legacy content contains risk and value. Proactive organizations address this challenge when planning a migration. Unfortunately, most organizations typically don’t view cleansing of data as a task that should be incorporated into their normal IT activities. As a result, poor search, data privacy violations, and records management problems abound, and repercussions prove costly.”

Concept Searching
Customer Location:
United States

This healthcare organization was migrating over 40 terabytes of legacy content from SharePoint on-premises to SharePoint Online. It needed to ensure all patient privacy and sensitive information was protected according to HIPAA guidelines. Although migration was a top priority, compliance, information governance, and cleaning up legacy content was necessary. The final objective was to classify the cleansed content to the industry-defined MeSH taxonomy.


“Although a cliché, delivering the right information to the right person in the right context still has not come to fruition with search engines. Unproductive workers cost US businesses almost $600 billion per year. Without meaningful metadata and categorization, the cost is going to continue to rise.”

This healthcare organization understandably placed the highest priority on protecting data privacy and sensitive information from exposure. It used the conceptClassifier platform, conceptClassifier for SharePoint Online, and conceptTaxonomyWorkflow both before and after migration. This the organization to evaluate and classify legacy content and eliminate stale information, duplicate content, multiple copies of the same content, and identify information that was no longer of value.

During this process, dark data of value, and unprotected privacy and sensitive information were identified, and could be moved to secure repositories, prevented from download, and have disposition determined by appropriate staff. Cleansing the corpus of content greatly reduced the time and effort needed to migrate. Content was classified against the MeSH taxonomy, search was significantly improved, and real-time vigilance applied to data privacy and sensitive information.


  • Implement real-time identification and protection of sensitive information such as PII and PHI, ensuring compliance with HIPAA
  • Use one set of technologies to identify, organize and retrieve content assets
  • Create a single source of truth, deploying a programmatic approach that was easily managed by subject-matter experts
  • Cleanse legacy content

This healthcare organization is made up of 2 hospitals, 7 specialty centers, and 19 clinics. It provides health, wellness and acute care services, including specialty care, primary care, home health, a research institute, and community outreach services.

The organization had over 40 terabytes of legacy content residing in file shares and was evaluating options to migrate to SharePoint Online and take advantage of OneDrive for Business. As part of that process, there was a requirement to improve knowledge management for business users, implement concept-based search, and enforce information governance initiatives involving data security and the protection and management of PII, PHI and other sensitive information, and compliance with regulatory guidelines.

The challenges faced included:

  • Inability to determine what should be deleted, saved, or archived from legacy content
  • Inability to identify privacy or sensitive information that needed special processing
  • Requirement to improve search results, information transparency, and knowledge management
  • Requirement for one tool that could be used throughout the organization by subject-matter experts, to manage content

This healthcare organization chose Concept Searching technologies due to their ability to address all its requirements, and to provide a short-term and long-term strategy for managing content. The solution deployed provided the organization with one set of technologies to achieve its objectives of data cleanup, migration, protection of privacy information, improved enterprise search, and effective content management. Compliance and information governance was incorporated from the outset, encompassing the identification of a single source of truth, HIPAA adherence, enabling control and protection of critical data, and proactive risk management.

The first challenge was dealing with legacy content. The content optimization solution identifies duplicates, versions, and redundant, outdated or trivial (ROT) data, but it goes far beyond the basic cleanup of the content. The process identifies any data privacy or organizationally-defined sensitive information, undeclared or erroneously tagged records, or noncompliance exceptions. These additional capabilities enabled the organization to identify sources of risk, some of which were unknown, and significantly reduced the amount of content to be migrated as well as the server footprint required, as the move from an on-premises environment to the cloud was made.

Protected health information and also sensitive information is protected in real time as content is created or ingested. Additional redaction capabilities are also available. The standard product comes with over 80 rules to address compliance requirements, including those related to HIPAA regulations. Content that contains privacy vulnerabilities is automatically moved to a secure repository, prevented from download, and notification is sent to the appropriate personnel for disposition.

Once the content optimization process was completed, the organization was in an excellent position to dramatically reduce migration efforts, and auto-classify content to the MeSH taxonomy. With the previously tagged and classified content, the quality of information retrieved by enterprise search was significantly improved. Content that was irrelevant, of no value, or stale was removed during the content optimization process. After the migration to SharePoint Online, users were able to search on phrases, concepts, and multi-word terms, in order to retrieve highly accurate content. This insight engine identifies similar concepts, subjects, and topics, even if the search words are never used.

The use of the solution by subject-matter experts was a critical feature for this organization. conceptTaxonomyManager, a component of the conceptClassifier platform, automatically creates an initial set of classification clues – semantic metadata – for the taxonomy. Administrators can tune the classification clues, create new ones on the fly, and quickly use test scenarios against their live content, prior to deployment. Nodes of the taxonomy represent the type of content associated with system-generated clues and those entered by administrators, and can consist of a single word or a string of words, entities and custom entities, acronyms, and synonyms, as well as keywords. This enables administrators to test the clues in real time and define a threshold for classification, based on clue and concept matching. This functionality was extremely important to the organization, as it delivered one tool that could be used to manage content enterprise-wide after the migration, using the taxonomy manager tool.

This healthcare organization was able to achieve all its goals. Through generation of semantic metadata, auto-classification, and the easy-to-use taxonomy component, subject-matter experts are now able to fine-tune the taxonomies. Using the technologies to help with the cleanup of the corpus of content, reduced the time and complexity of migration. Since content had been tagged and classified, enterprise search and the ability to find highly relevant and accurate information are by-products of the initial efforts, and significantly improve access to relevant content and knowledge assets. Privacy and sensitive information is identified in real time, moved to secure repositories, and its download prohibited.

  • Implements real-time identification and protection of sensitive information such as PII or PHI, ensuring compliance with HIPAA
  • Uses one set of technologies to identify, organize and retrieve content assets
  • Creates one source of truth, deploying a programmatic approach that is easily managed by subject-matter experts
  • Significantly improves enterprise search and knowledge access
  • Solves problems in content identification and management
  • Utilizes information governance workflows, transparently to end users

Ask a Question

Leave your details and one of our consultants will get back to you.

Concept Searching