Auto-classification – Not sure why this is a hard concept to understand?
Last year we did a ‘non-vendor pitch’ webinar on metadata, auto-classification, and taxonomies. The recorded webinar can be accessed here if anyone is interested. We had a huge response and was the most attended webinar of the year. What I have found is when you say ‘taxonomies’, either people sort of vaguely understand what they are, but they are perceived as so out dated and not a revolutionary new technology. This may be relevant since taxonomies do go way back to Aristotle. I digress.
To revisit the subject, classification techniques can be content based, request oriented, or policy based.
Content based classification is classification in which the weight given to particular subjects in a document determines the class (category in the taxonomy) to which the document is assigned. It is, for example, a rule in much library classification that at least 20% of the content of a book should be about the class to which the book is assigned. In automatic classification it could be the number of times given words appears in a document.
Request oriented classification (or -indexing) is classification in which the anticipated request from users is influencing how documents are being classified. The classifier ask himself: “Under which descriptors should this entity be found?” and “think of all the possible queries and decide for which ones the entity at hand is relevant”.
Request oriented classification may be classification that is targeted towards a particular audience or user group. For example, records may be classified/indexed differently as compared to marketing content. It is probably better, however, to understand request oriented classification as policy based classification: the classification is done according to some ideals and reflects the purpose of the classification. In this way it is not necessarily a kind of classification or indexing based on based on the audience. Only if empirical data about use or users are applied should request oriented classification be regarded as a user-based approach.
There are three types of methods to manage auto-classification, Supervised, Unsupervised, and Semi-Supervised.
In Supervised classification, some external mechanism, such as human feedback, provides information on the correct classification – this is manually intensive – think millions of documents
In Unsupervised, also known as document clustering, the classification has no reference to external information usually found in neural networks and artificial intelligence
In Semi-supervised, parts of the documents are labelled by an external mechanism and some by human intervention
Semi-supervised is ideal, as the heavy lifting is done by the system and humans can tweak it.
(If you have a few minutes and use SharePoint or Office 365, could you kindly take our metadata survey? You could win a free conference pass to Microsoft Ignite. We would greatly appreciate it)