Auto-classification – where’s the confusion?
Auto-classification is a catch all term that will scan the contents of a document and automatically assign categories and keywords based on the document contents. The issue, of course is that not all classification technologies are created equal. The choice remains for the organization to select the classification system based on the underlying technology that best supports the business objectives. For example, automatic classification may prevent the ability to alter or change the classifications. Some require large training sets and if using rule building, it must be completed in multiple iterations, and rules maintained or created to improve accuracy. Others require outside application specialists, learning new languages, and integration issues when used to deploy intelligent metadata applications, if they can be integrated at all. Some are restricted to certain platforms, such as availability in SharePoint, but not Office 365. Some capture metadata, semantic metadata, or use some form of algorithms, some tweakable and some this is what you get – no changes.
The optimal auto-classification component, can be used in real-time or on a scheduled basis. The primary advantage is the ability to auto-classify content, eliminating information silos and disconnected systems as there are no restrictions on the source repository to be classified. Inheriting the security of the organizational platform, for example SharePoint, users are prevented from unauthorized access, and if required for security, portability of content assets when accessing classified content. Native integration with enterprise platforms removes integration issues, and reduces, if not eliminates, the learning curve typically required for planning and deployment. It also eliminates end user training, as it become a transparent background process.
Content is dynamic and the taxonomy should be flexible to change as business strategies and content structures change. The classification process adapts to the organization as content is changed, moved, or deleted. The taxonomy coupled with automated classification form the foundation to realizing the benefits of information governance; in fact all content centric applications will realize business benefits by leveraging the capabilities of the taxonomy.
In our software, the automated classification process identifies during indexing categories that each document belongs to. Each category is identified by a unique descriptor and is associated with key descriptive words and/or phrases held in the database. This approach enables a rapid implementation of a corporate taxonomy with all documents classified to multiple nodes at index time. Ideally, the taxonomy can be used to browse the document collection or as a filter when running ad hoc searches. Other vendors do it differently.
Currently, the marketplace is becoming more savvy about auto-classification and the word is now being bandied about by the media. I still feel there is much to explore in auto-classification and more education needed for organizations to fully leverage unstructured content by deploying an enterprise framework for metadata, auto-classification, and taxonomies.