So You Want to Do Text Analytics? Step 2: Creating the Content Set

Text Mining and Analytics

I am not sure that the term ‘content set’ is ever used, except by me, right now. I am sure everyone will now be talking about it! In our previous blog on text analytics, I talked about content optimization, which means, to be blunt, cleaning up the garbage that you have been saving for years.

The next step becomes more fun and frustrating. What content do we need and can we find it? This is no easy task. One of our clients has over three million active documents with varying subjects. Obviously, time will not permit the manual review of the content. It’s easy in big data – fields are clearly defined and can be swapped in and out as needed, if one is not quite right.

Creating the content set should be greatly simplified, as content optimization first eliminated the garbage, duplicates and near duplicates, processed records that were never declared, and identified and protected data privacy and sensitive information.

Whew. That was probably the hardest task to complete. In our case we would recommend the development of one or more taxonomies. A taxonomy component has the capability to group unstructured content, based on an understanding of concepts and ideas that share mutual attributes, while separating dissimilar concepts, regardless of where content is stored. This is part of the discovery process, as content that you never expected will surface and you can connect the dots on the information you are seeking.

What’s nice about honing the content set in a taxonomy, at least in ours, is real-time feedback. If you make a change, you can immediately see the impact on the content selected. If you can’t think of all word strings or keywords, then prompting is available that will find similar concepts to be included, which you can accept or reject. And of course there is rollback – not that any of us make mistakes … There is no massive set of training documents, and you can see the cause and effect of changes without reindexing. These features are enormously helpful in creating a usable and inclusive content set that contains the information that you need.

Et voila. Now you can start performing analytics using your favorite tool. What do our clients use? Something as simple as Excel, which I still struggle with, or any business or artificial intelligence application, such as Power BI, Birst, Click, Azure Machine Learning Center, or Cortana Analytics Suite. Those are the software packages used by our clients. If you are doing text analytics, which package do you use?

Real-life knowledge discovery scenarios, and the significant return on investment achieved, can be heard in the expert webinar recording ‘What You Don’t Know M        ay Hurt You – Achieving Insight and Knowledge Discovery.’