National US Law Firm

Case Study

  Industry Case Studies    All Case Studies

Content Optimization – Reduce Risk and Ensure Compliance

Organizations fail to realize the impact of unmanaged content and how it can derail the success or failure in achieving organizational objectives. Over 90 percent of documents are never accessed after creation. Unfortunately, this significantly impacts search. Multiply this by the number of documents you create or ingest, and the majority are being fed to the search engine index and have no value.”

Nicola Barnes
Customer Location:
United States

The firm was migrating to a cloud-based content management legal solution. A component of the migration was an analysis of content to identify documents that should be deleted, archived or protected.


“Proactive management of unstructured and semi-structured content is no longer an option but a necessity. Ignoring the problem, and it is a problem, increases organizational risk and noncompliance, making information governance unattainable. In the legal environment, this can compromise clients’ confidentiality.”

This firm was migrating to a cloud-based content management legal solution. It was proactive enough to realize its corpus of content, containing terabytes of information, needed to be analyzed and reviewed. Content optimization, a solution provided by Concept Searching, was used to cleanse the corpus of content and provide the path to intelligent migration. The company approached the project as an opportunity, to meet several requirements.


  • Ability to tag and process undeclared or erroneously declared records
  • Identify privacy or organizationally-defined sensitive content that was unprotected
  • Determine whether content should be archived or deleted
  • Identify duplicate documents and versions
  • Identify redundant, obsolete, or trivial (ROT) content

One of the largest US law firms, this organization is known for its work with major construction companies in Europe and Asia, multinational pharmaceutical companies in New York and California, key insurance and financial services organizations in Texas, Indiana, and Alabama, and national tire manufacturers in Tennessee and Georgia.

Once the firm decided to migrate to a cloud-based content management legal solution, it realized this was not a simple process. It recognized that its corpus of content had become unmanageable over the years, due to growth and mergers. With terabytes of content, it was not feasible to manually evaluate the contents of each document for disposition. Even with other tools, metadata generated is primarily system metadata, such as name, date, and title. That would not work for this firm as it needed access to what was contained in documents, not a summary of information. It also needed a method to offer defensible deletion, with full audit capability. Even though content optimization had to be done, the primary objective was to migrate to the SharePoint environment.

The challenges faced included:

  • Unable to determine content within each document for analysis
  • No ability to identify privacy or sensitive information that needed special processing
  • Lacked the facility to determine whether a document should be archived or deleted
  • Inability to identify undeclared records
  • No way to identify and safely delete redundant, obsolete, or trivial (ROT) content

The firm chose Concept Searching’s content optimization solution, using the conceptClassifier platform, conceptClassifier for SharePoint Online, and conceptTaxonomyWorkflow. It had vast amounts of content stored in file shares, with replicas of documents spread across servers, different versions of documents, and content of no value contributing to the terabytes of information. Manual assessment was not possible, and most tools did not provide the type of analysis required by this firm.

The content optimization solution identifies duplicates, versions, and redundant, obsolete, or trivial (ROT) content, but it goes far beyond the basic cleanup of content. The process identifies any data privacy or organizationally-defined sensitive information, undeclared or erroneously tagged records, and noncompliance exceptions. These additional capabilities provided the firm with the ability to identify sources of risk, some of which were unknown, and significantly reduce the amount of content to be migrated, as well as the server footprint required.

There are two ways to initiate this process. The first is to automatically generate semantic metadata and auto-classify it to an enterprise taxonomy. The taxonomy component will automatically create an initial set of classification clues – semantic metadata – for the taxonomy. Administrators can tune the classification clues, create new ones on the fly, and quickly use test scenarios with their live content prior to deployment. This approach has advantages in identifying content that was unknown to exist. Nodes of the taxonomy represent the type of content found associated with system-generated clues or administrator-entered clues, which can consist of a single word or a string of words, entities and custom entities, acronyms, and synonyms, as well as keywords. With this approach, nodes of the taxonomy would represent the type of information sought, such as privacy or confidential information, records, duplicates, or ROT. The conceptTaxonomyManager component included in the conceptClassifier platform provides the ability for administrators to test clues in real time and define a threshold for classification based on clue and concept matching.

The second way to approach content optimization is to create separate taxonomies that will identify the type of content required. This approach has the advantage of providing a taxonomy for subject-matter experts that aligns with their functional roles, for example, records managers, legal assistants, and security managers. In this way, they remain familiar with the corpus of content that is directly related to their functional group, for future maintenance and workflow requirements. The two options are not exclusive.

The firm opted for the first option, to automatically tag all content and classify it to an enterprise taxonomy. From there, designated administrators could modify the whole taxonomy or be limited to modifying specific nodes within the taxonomy. The taxonomy does provide auditing, rollback, and node locking, to ensure modifications that have been made are not overwritten.

For legal firms, protecting clients and security are always major concerns. Concept Searching offers many types of classification rules, for a wide range of applications. These include regular expressions and natural language processing (NLP) integration, suitable for detecting many types of PII, PHI, HIPAA, and PCI data. The solution is supplied with many of the common PII types, such as credit card numbers and US social security numbers, as well as over 80 additional policies.

One of the unique differentiators this client took advantage of was the ability to create or customize the identification of sensitive or confidential information contained within content, through the taxonomy manager interface. Authorized users can rapidly create their own patterns for detection, using any verbiage or descriptors. Since the Concept Searching technologies generate semantic, multi-term metadata from within content, inter or intra-related content is also identified, even if it does not match the pattern specified. Extremely important in the protection of unique privacy or organizationally-defined sensitive data, this also identifies relevant content that is unknown.

A significant additional benefit was the ability to improve search after performing the content optimization and migration to SharePoint. Since the corpus of content was cleansed, there was a reduction in the amount of information previously stored. This improved search, by weeding out duplicates, different versions, and content of no value. The automatic semantic metadata generation was ingested in SharePoint search, providing end users with the ability to perform semantic, concept-based search, improving the relevancy and accuracy of content.

The firm was able to achieve all its objectives, and migration was completed on time and on budget. Using the Concept Searching technologies, the corpus of content was greatly reduced, and migration was swift. In addition, it was able to achieve lasting benefit in the improvement of search.

  • Ability to cleanse terabytes of content, saving time and money
  • Defensible deletion, with full audit capability
  • Protection of data privacy and confidential information in real time, protecting client confidentiality and compliance
  • Search performance was significantly improved, due to culling content of no value and the new ability to perform semantic searches

Ask a Question

Leave your details and one of our consultants will get back to you.

Concept Searching