The 80 Percent Statistic – Fact or Fiction? Just Say, “I Made It Up”
For years, the 80 percent statistic has been used, by me and everyone else in my work orbit. For those of you who have been lucky enough never to have come across it, the concept is, “80 percent of an organization’s data is unstructured.”
In fact, it seems the idea has seen a recent resurgence and is popping up all over the place, used most probably by a whole new generation who believe this is a new statistic. It isn’t. In fact, there is no proof it actually is a trusted statistic.
From my tattered old notes, I have it attributed to Gartner, but when Sue Feldman, formerly of IDC, was asked about it, she said that it came from a very old study, produced at least 19 years ago, by IBM. But IBM did not back up the figure with hard facts that would make it true. In other words, they made it up.
Way back in 1991, a Software magazine article made the daring statement that as much as 90 percent of data in an organization is unstructured, or rather, “non-numerical freeform data.” Jumping into the fray in 1996, Oracle agreed and stated that, “90 percent of digitally stored data is unstructured information, mostly text.” In a 1998 report by Merrill Lynch on enterprise portals, at that time an ‘emerging concept,’ surmised that approximately 80 percent of content in an organization was unstructured, stating, “some estimates run as high as 80 percent.” Again, no proof.
Fast forward to 2006, and The Data Warehousing Institute claimed that structured data made up 47 percent, unstructured 31 percent, and semi-structured 22 percent. Ahem, structured data? Not wishing to imply they were biased. Although unstructured and semi-structured data do actually outweigh structured, they still fall far short of the 80 percent we are looking for to establish the truth. A single source of truth seems to be non-existent here.
So how did this become fact and not fiction? I was wondering if it had to do with the Pareto principle and when some marketer, similar to me, decided that they needed something substantive to back up the claim that there’s a lot of unstructured content. 80 percent sounds better than “a lot.”
At least it has the right numbers. The principle states that 80 percent of effects come from 20 percent of the causes. It has actually nothing at all to do with unstructured content, but is repeatedly used in science, sports, software, wealth distribution, engineering, and just about everywhere someone can twist it into making sense. Richard Koch, a British author, went so far as to write a book about it, ‘The 80/20 Principle,’ on how to apply the principle to life and business. Why not unstructured content?
So there you have it. I do not have the answer. Despite the popularity of this statement, it does not appear that there is any truth in it, at least until someone comes up with some facts and figures. Sometimes, when I come across a reader in a bad mood – hardly ever, thank goodness – I get grilled on my information sources. I can honestly say that, typically, I am very careful. From now on, I am just going to say, “I made it up.” I wish analysts would say that. I think we can all agree, there is a lot of unstructured content. If anyone does really know the source of the quote, please let me know.
Join us for our What You Don’t Know May Hurt You – Achieving Insight and Knowledge Discovery webinar, on Tuesday, November 14th. This session shows how text analytics and mining can boost the bottom line, through insight and knowledge discovery. Our guest speaker is Russ Stalters, information management strategist and former BP executive, who will explore real-life knowledge discovery scenarios, and discuss the significant return on investment achieved using text analytics.