Once the firm decided to migrate to a cloud-based content management legal solution, it realized this was not a simple process. It recognized that its corpus of content had become unmanageable over the years, due to growth and mergers. With terabytes of content, it was not feasible to manually evaluate the contents of each document for disposition. Even with other tools, metadata generated is primarily system metadata, such as name, date, and title. That would not work for this firm as it needed access to what was contained in documents, not a summary of information. It also needed a method to offer defensible deletion, with full audit capability. Even though content optimization had to be done, the primary objective was to migrate to the SharePoint environment.
The challenges faced included:
- Unable to determine content within each document for analysis
- No ability to identify privacy or sensitive information that needed special processing
- Lacked the facility to determine whether a document should be archived or deleted
- Inability to identify undeclared records
- No way to identify and safely delete redundant, obsolete, or trivial (ROT) content
The firm chose Concept Searching’s content optimization solution, using the conceptClassifier platform, conceptClassifier for SharePoint Online, and conceptTaxonomyWorkflow. It had vast amounts of content stored in file shares, with replicas of documents spread across servers, different versions of documents, and content of no value contributing to the terabytes of information. Manual assessment was not possible, and most tools did not provide the type of analysis required by this firm.
The content optimization solution identifies duplicates, versions, and redundant, obsolete, or trivial (ROT) content, but it goes far beyond the basic cleanup of content. The process identifies any data privacy or organizationally-defined sensitive information, undeclared or erroneously tagged records, and noncompliance exceptions. These additional capabilities provided the firm with the ability to identify sources of risk, some of which were unknown, and significantly reduce the amount of content to be migrated, as well as the server footprint required.
There are two ways to initiate this process. The first is to automatically generate semantic metadata and auto-classify it to an enterprise taxonomy. The taxonomy component will automatically create an initial set of classification clues – semantic metadata – for the taxonomy. Administrators can tune the classification clues, create new ones on the fly, and quickly use test scenarios with their live content prior to deployment. This approach has advantages in identifying content that was unknown to exist. Nodes of the taxonomy represent the type of content found associated with system-generated clues or administrator-entered clues, which can consist of a single word or a string of words, entities and custom entities, acronyms, and synonyms, as well as keywords. With this approach, nodes of the taxonomy would represent the type of information sought, such as privacy or confidential information, records, duplicates, or ROT. The conceptTaxonomyManager component included in the conceptClassifier platform provides the ability for administrators to test clues in real time and define a threshold for classification based on clue and concept matching.
The second way to approach content optimization is to create separate taxonomies that will identify the type of content required. This approach has the advantage of providing a taxonomy for subject-matter experts that aligns with their functional roles, for example, records managers, legal assistants, and security managers. In this way, they remain familiar with the corpus of content that is directly related to their functional group, for future maintenance and workflow requirements. The two options are not exclusive.
The firm opted for the first option, to automatically tag all content and classify it to an enterprise taxonomy. From there, designated administrators could modify the whole taxonomy or be limited to modifying specific nodes within the taxonomy. The taxonomy does provide auditing, rollback, and node locking, to ensure modifications that have been made are not overwritten.
For legal firms, protecting clients and security are always major concerns. Concept Searching offers many types of classification rules, for a wide range of applications. These include regular expressions and natural language processing (NLP) integration, suitable for detecting many types of PII, PHI, HIPAA, and PCI data. The solution is supplied with many of the common PII types, such as credit card numbers and US social security numbers, as well as over 80 additional policies.
One of the unique differentiators this client took advantage of was the ability to create or customize the identification of sensitive or confidential information contained within content, through the taxonomy manager interface. Authorized users can rapidly create their own patterns for detection, using any verbiage or descriptors. Since the Concept Searching technologies generate semantic, multi-term metadata from within content, inter or intra-related content is also identified, even if it does not match the pattern specified. Extremely important in the protection of unique privacy or organizationally-defined sensitive data, this also identifies relevant content that is unknown.
A significant additional benefit was the ability to improve search after performing the content optimization and migration to SharePoint. Since the corpus of content was cleansed, there was a reduction in the amount of information previously stored. This improved search, by weeding out duplicates, different versions, and content of no value. The automatic semantic metadata generation was ingested in SharePoint search, providing end users with the ability to perform semantic, concept-based search, improving the relevancy and accuracy of content.