Dealing with Legacy Data and Optimizing Your Content

Content Optimization Blogs  Perkins + Will  Case Study  Healthcare Webinar Content Optimization Solutions  Request a Demo


“Inaccurate and unverified data can jeopardize the future of the organization.” Accenture

There are only two different opinions on dealing with legacy data – either ignore it or address the problem and proactively manage the laborious cleanup and analysis process from inception onwards. Left unmanaged, legacy content can complicate governance, compliance, and security functions, preventing them from performing effectively.

At the most basic level, the performance and accuracy of results using enterprise search is greatly impaired. From the aspect of risk mitigation, it is legacy, dormant, and dark data that complicate data discovery, as all content, regardless of its value, must be identified for accurate classification and protection. Legacy data is an active and malignant source of enterprise risk.

Global market intelligence firm International Data Corporation (IDC) states that employees spend 2.5 hours per day duplicating or recreating work that has already been done, and the cost is $5,000 per employee per year. From a different perspective, there are some estimates that suggest 20 to 35 percent of an organization’s operating revenue is wasted recovering from process failure and information rework.

What do workers do when they can’t find information? They recreate it, use out-of-date content assets, interrupt co-workers to help find the information, start the task without the information needed, or simply don’t start.

What is interesting is that this topic represents a hidden cost of doing business, meaning most managers, executives, and IT teams are not aware it is happening. We have a client who was timing the retrieval of content during a proof of concept. In this particular test case, the content was a legal document, so it had to be found. It took the client a day and a half to find the document using its normal search procedures. This one search represented 12 hours of unproductive time spent finding one document, by one user. Another client analyzed its information retrieval accuracy, and determined that its knowledge workers had previously spent 40 percent of their time trying to find the ‘right’ information.

Looking at the broader picture, one wonders how many decisions are made using inaccurate information. Organizations consider information to be an asset, but often content cannot be reused or repurposed, simply because it cannot be found.

Back to Top

According to various surveys, 70 percent of content stored on file shares is redundant, obsolete, or trivial. And 25 percent of content is duplicated, 10 percent has no business value, and an enormous 90 percent is never accessed after creation. Legal experts at the Compliance, Governance and Oversight Council (CGOC) Summit estimated that 69 percent of the data most organizations keep can, and should, be deleted. Digital security company Gemalto recently released the results of a global study that revealed that 65 percent of organizations are unable to analyze all the data they collect, and only 54 percent of companies know where all their sensitive data is stored. Compounding this uncertainty, 68 percent of organizations admit they do not carry out all procedures in line with data protection laws, such as General Data Protection Regulation (GDPR).

End users are hoarders. Business units start to panic when they think something is to be deleted. Migration is known to be fraught with problems, many of them due to moving unused content of no value with no logical storage repository. The potential for exposure of sensitive information, migration of multiple copies of the same document, increased costs, inefficient use of resources, and unknown legal exposures, are just a few of the issues that arise from legacy content. In most organizations, the IT team will migrate content, regardless of value, because the cleanup task is onerous and considered a low priority. This impacts data discovery and enterprise search, serving up many duplicates and versions of the same document as well as several false negatives and positives, resulting in erroneous classifications.

Added to the mix is the more recent problem caused by cloud storage. Seemingly under the impression that cloud storage is limitless, there has been a dramatic increase in end user storage of personal data, such as pictures, music, and games, now to be managed by the organization. Shadow IT, a growing issue, is typically dealt with when it is found, but instances of unsanctioned software and data can permeate an organization. New legal directives now require organizations to identify and track all social communications as records and as approved electronically stored information (ESI).

Back to Top

To clean up an organization’s corpus of content, the contextual meaning of each document needs to be searched to determine its value. It cannot be done manually, as the number of documents is too great, and the consistency of human review and decision making is unreliable, as well as costly. If a tool is used, it will return erroneous answers, based on poor metadata entered by end users or due to an inability to identify the context within the content. With an influx of data beyond an organization’s control, there are few alternatives to resolve the issue.

Content optimization is Concept Searching’s approach that effectively provides detailed information on the ‘content in context’ found in file shares, emails, and attachments or alternatively in a data discovery scenario, while identifying specific subjects or topics, such as privacy or sensitive information vulnerabilities. This unique capability enables administrators to specify verbiage in the form of phrases, rather than keywords, within a taxonomy.

The conceptClassifier platform generates multi-term metadata and classifies it against one or more taxonomies. The platform performs a detailed file analysis and content inventory. Based on classification decisions, action is taken on the content, enabling it to be either managed in place or automatically moved to a more appropriate repository. Content optimization identifies dark data, redundant, obsolete, and trivial (ROT) content, data of no value, copies, multiple versions, privacy and sensitive information violations, undeclared records, and compliance violations. The output provides the exceptions, and administrators determine the future status of the corpus of content and data. It has the added benefit of reducing the server footprint. One client reduced its server footprint from 57 to 4, representing substantial savings.

Since the technology automatically categorizes content when it is created or ingested, and the contextual meaning is extracted in the form of multi-term metadata then fed to the search engine index, enterprise search is significantly improved, and data discovery and subsequent classification is more precise and accurate.

Cleansing Email
Pundits and analysts predict that all the collaboration options adopted by organizations will result in a decrease of the email glut. But it is not happening. Did you know, on average, US employees have 199 unopened or unread emails at any given time?  

Email contains not only useless content but also risky content, which should be protected, deleted or have access rights changed. Over 90 percent of cybersecurity attacks start with email. Organizations that do not address the issues of email glut, security, and archiving face serious risk. The problem has been identifying at the contextual level what is in the content of emails, and attachments, and then applying content lifecycle management for the disposition or archive of that content.

For example, let us assume you are trying to identify any data privacy exposures in your emails and attachments to emails. Concept Searching provides over 80 standard descriptors out of the box, meaning items such as PII, PHI, and credit card numbers will be identified. But in this example, we will assume you are trying to find any exposures that have to do with ‘payroll.’ Using the optional component conceptTaxonomyWorkflow, a rule can be easily deployed that will instruct the insight engine to find any emails or attachments containing the word payroll. The result is that all content that contains the word payroll will be returned. Since the insight engine is capable of understanding the term payroll, the results may also include employees’ wages, compensation, or staff salary. This eliminates the necessity of ‘knowing’ the correct term to retrieve results that are relevant. It is a thorough and detailed approach that eliminates errors but is still controlled and managed by an administrator.

Back to Top

Information governance is slowly becoming a boardroom issue. Shrinking budgets and rising operating costs, coupled with a new emphasis on data accountability, are becoming the catalysts that will ultimately push organizations to gain the insight needed to proactively manage all digital assets to derive the most value from them.

Although content optimization is rarely considered a priority, it does impact the bottom line. It can represent a costly server footprint that is no longer needed, increase risk in eDiscovery and litigation, significantly diminish the accuracy of enterprise search, and increase the risk of unidentified and unprotected security and compliance violations requiring remediation.

The traditional approach of making end users responsible for content maintenance is no longer viable. Arbitrarily deleting content by date or other variables disrupts the flow of content lifecycle management and impacts records management. The impact of unmanaged data has far reaching consequences that are not readily apparent.

Those organizations that address the issues associated with legacy data, and put the processes in place to proactively manage all data assets, will achieve greater information transparency and be better positioned to transition to operational excellence.

Back to Top

Concept Searching