Big Data: Peeling the Layers Off an Onion
In our most recent blog about Big Data we looked at one definition, the three V’s, volume, velocity, and variety. There are two more noteworthy definitions we can also look at. One is VAST which stands for Variable Attributed Subjects, or People and Time. In this definition one or more variable attributes are in the thousands, the tens of thousands, or even the millions, which makes it hard to extract meaningful data and surpasses the capabilities of traditional data analysis tools.
The second additional definition was coined from Berkeley, called AMP which stands for Algorithms, Machines, and People. This one makes sense too. They define Big Data as any data set where the data is expensive to manage and hard to extract value from. In this example, it is not necessarily the speed or velocity but the challenge of understanding the data for business advantage, typically with large amounts of data.
When we start to peel off the layers other considerations pop-up. There is ‘unstructured data’ and ‘unstructured content’. For example, Hadoop and MapReduce are technologies that can process unstructured data that doesn’t quite fit the typical data warehouse or analytical tool model. A good example would be click stream data that has some structure but doesn’t fit neatly into rows and tables.
Unstructured content, which is our focus, is free-form language, emails, documents, collaboration, social networking applications, etc. Looking at the statistics it is estimated that 80% of business decisions are made from unstructured content and it is growing at an annual rate of 62%. Deriving business advantage from this one is a bit trickier because it is created by humans, not machines. This translates into highly valuable information or knowledge capital that is often overlooked because it can’t be found. Many organizations, if not most, do not have a plan nor do they manage their unstructured content. Not only that, they are not using their unstructured content at a most basic level to re-use content (50% of documents are duplicates); improve search (a typical end user will spend 2.5 hours per day searching for information); or to find ‘relevant’ information (85% of relevant documents are never retrieved in search). And the list goes on and on.
At the end of the day, whether it is structured data, semi-structured data, or unstructured content the need to extract value and analyze it to make better business decisions is the bottom line.