False Negatives and Positives – Two Ways, and I Can Bet Which One You Will Pick
I will say right up front that I could never even attempt a manual approach to manipulate the number of false positives or negatives – not possible. Hopefully, you never have to attempt it. If you struggle with this problem or are just interested, here are two ways to accomplish weighting to improve classification.
In the world of data discovery and classification, false negatives and positives are extremely important. According to just about everybody, except me, recommendations are that your hierarchy should be of no more than 6 levels. This minimalist approach has partially evolved from the limitations of tools. Many products still do not use classification, and it is just not feasible to manually categorize each discrete piece of content.
For those that attempt a manual approach, classifying unstructured and semi-structured content poses problems in accuracy and thoroughness, and necessitates human involvement, which is subjective, at best, and highly unproductive. The premise of this categorization approach is that by restricting the classification to fewer categories, the false positives and false negatives will decrease, and the alerts will be of higher fidelity.
Harking back to my days with the use of enterprise search tools, the two key measures were, and still are, precision and recall. Precision meaning that only relevant content is returned based on a query. Recall meaning that all documents that might be related to the query are returned.
In the search world, precision and recall should be balanced. For the world of security, the two factors control the number of false positives and false negatives. Depending on the data, the optimum solution may not be equalizing them, but adjusting the classification to place more dominance on precision – accuracy – than on recall, or vice versa.
One of the shortcomings in data classification is false positives and negatives. A false positive occurs when the classification technology mistakenly predicted a positive class. Let’s assume the classification inferred that a particular email message was spam – the positive class – but the email message was not spam. The number of false positives feed into the calculations for precision and recall.
For a discovery application, taxonomy administrators can assign a weight to the classification, depending on whether they want to modify either precision or recall. Typically, the goal is to have them balanced for a search application. For the identification of personally identifiable information (PII), protected health information (PHI), sensitive information, or compliance applications, an organization may opt for higher recall.
Although the conceptClassifier platform can do this easily through the conceptTaxonomyManager component, if using a different product, security staff can do this manually or the solution may be semi-automated.
In cases where the objective is to find an optimal blend of precision and recall, the two metrics can be combined using what is called the F1 score. The F1 score is the harmonic mean of precision and recall, taking both metrics into account in the following equation:
F1 = 2 * Precision * Recall/Precision + Recall
Although metrics such as recall and precision may seem out-of-date, they often work when adjusting imbalanced classifications. Statistics provides us with the formal definitions and the equations for recall, precision, F1, and the receiver operating characteristic (ROC) curve.
ROC curves were developed for use in signal detection in radar returns in the 1950s, and have since been applied to a wide range of problems. Neural networks and many statistical algorithms are examples of appropriate classifiers, while approaches such as decision trees are less suited. Algorithms that have only two possible outcomes are most suited to this approach. Any sort of data that can be fed into appropriate classifiers can be subjected to ROC curve analysis.
For a perfect classifier, the ROC curve will go straight up the Y axis and then along the X axis. A classifier with no power will sit on the diagonal, while most classifiers fall somewhere in between.
“ROC analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution.” Wikipedia article on ROC
The second method would be using our conceptClassifier platform. Many of our taxonomy features are still unique in the market, but you should be looking for an automated solution. Using the conceptClassifier platform, this can be accomplished through automatic document movement feedback, which enables taxonomy administrators to see the cause and effect on changing the clue weightings for a node in the taxonomy.
Users can also search within the refined node and bring back documents from the whole corpus, now classified against the node. The system will indicate if the change has increased or reduced the score, as well as identify documents that will no longer be classified and new documents that will be classified. The advantage of using a tool such as conceptTaxonomyManager is the real-time capability that enables taxonomy administrators to see the impact on the classifications without re-indexing.
This feature is also used in working sets. Working sets enable administrator to tune the rules within conceptTaxonomyManager to include or exclude certain documents. This is to ensure that false positives are not getting classified and elite documents are getting classified. The option to add or remove a document from a working set is also provided. Each term can have a set of documents associated with it for testing purposes. Using the ‘show document movement feedback’ feature in conceptTaxonomyManager, in conjunction with the working sets, visually displays the cause and effect of the changes without re-indexing.
Many clients have millions of documents, and the ability to see the changes without re-indexing is a significant benefit – saving time, increasing productivity, and improving the accuracy of the classification process. Within conceptTaxonomyManager, administrators have access to a calculations link that shows why a document received the score it did, how many times the terms appeared in the document, and additional information to provide an understanding of the classification, with the ability to change the score, test it, and optionally apply the change to the taxonomy.
If you are looking to be on the cutting edge, not the bleeding edge, and would like to tackle your metadata problems, we still remain unique in the industry with our ability to generate multi-term metadata. Want a third-party opinion? Read about how we were recently identified as a strong performer and a provider that matters, in the latest file analytics report by a leading independent research firm.
Our webinars also address the topics explored in our blogs. Access all our webinar recordings and presentation slides at any time, from our website, in the Recorded Webinars area, via the Resources tab.