Big Data – Has Its Time Come? Welcome, Big Content
I recently read an article in which Darin Stewart from Gartner substituted ‘content’ for ‘data’ – so we now have big content. I don’t know what’s going to happen to text analytics and its assorted names, but I don’t mind big content. The analysis and extraction of content from sometimes many documents does need a better name. So be it. I will also adopt big content.
Where do we begin? What exactly is big content? Gartner defines unstructured data as content that does not conform to a specific, pre-defined data model. Despite the stubborn approach of software vendors, content is typically generated by humans, and can be a valuable asset but is rarely extracted or used to solve business problems.
Coming under this umbrella are documents, spreadsheets, presentations, email, and web content – basically, any output that is text based. In many organizations, unstructured content is a mess, but for those where this is not the case, these disciplines are managed. For example, email is managed, monitored, and archived – we hope. The same is true of documents processed from inception to archive, via an enterprise content management platform. These solutions are also mature and focus on management of the content assets, and are not designed for analysis, exploration, or extracting value.
A taxonomy would help immensely when building a repository to identify subjects and verbiage. Once the taxonomy is in place, then the content should be classified to the taxonomy to eliminate end user involvement – we are still only human.
How is this accomplished? Our software, and that of other vendors, can generate metadata. But our differentiator is the generation of multi-term metadata based on the concepts – subjects, topics, and phrases – that users have identified, or that have been found, through the classification process. The added benefit, which is huge, is that this process significantly improves search, as users can now search on concepts.
Once that has been done – and it is a relatively fast process – technically, content can be classified straight out of the box. Our software has been designed for business rather than technical users. Why? Because business users are those closest to the content, and their job functions may employ terms that users in the IT team may not be aware of.
The taxonomies will reflect the unique verbiage found within an organization’s corpus of content. This is important because pre-defined taxonomies or industry solutions will not know language that is specific to an organization, and change is difficult. Once the taxonomy has been refined, it can be exported to any analysis tool of choice, such as Excel, or any business or artificial intelligence application. Our clients use Power BI, Birst, Click, Azure Machine Learning Center, and Cortana Analytics Suite.
Real-life knowledge discovery scenarios, and the significant return on investment achieved, can be heard in the expert webinar recording ‘What You Don’t Know May Hurt You – Achieving Insight and Knowledge Discovery.’