Where are these zettabytes coming from?
In my previous post we chatted about what a zettabyte actually is. The next question begging to be answered is where are the zettabytes coming from? Some may doubt the accuracy of growth predictions with the term ‘zettabyte’ but consider the following:
- Twitter has 250 million tweets per day or over 46MB/sec of data created
- Facebook has 950 million users using the service within a given month
- LinkedIn has over 100 million users (mid-2011)
- The data warehouse Hadoop cluster at Facebook is rumored to be the largest in the world with:
- 21PB of storage in a single HDFS cluster
- 2000 machines
- 12TB per machine (a few have 24TB each)
- 1200 machines with 8 cores each + 800 machines with 16 cores each
- 32GB of RAM per machine
- And the list goes on…
- Facebook collects an average of 15TB of data every day or 5000+ TB per year, and has more than 30PB in one cluster (March 2011)
- 107 trillion emails were sent in 2010 (interestingly 90% are spam and viruses)
- There were 172 million blogs and more than 1 million posts per day
- Goggle has more than 50 billion pages in its index (December 2011)
- YouTube has more than 800 million unique visitors per day, and 4 billion hours of video are watched each month, and 72 hours of video are uploaded to YouTube every minute
- Amazon’s S3 cloud service had some 262 billion objects at the end of 2010, with approximately 200,000 requests per second
Luckily much of the above statistics include ‘personal’ communications outside the realm of the organization. It is simply an interesting view of the growth of Big Data, specifically unstructured content. Unstructured content includes emails, word processing documents, multimedia, video, PDF files, spreadsheets, messaging content, digital pictures and graphics, mobile phone GPS records, and social media content which combine all the other elements on a gargantuan scale. Good news for us, as someone has to make sense of it.
The typical organization doesn’t have to deal with such massive amounts of unstructured content. With Big Data as the buzzword du jour it will eventually become mainstream. Since the capture and analysis of unstructured content remains elusive as the industry approach has been to force it into the neat rows and columns of a database. Guess what? It doesn’t work.
Forward thinking organizations are recognizing the need to manage their unstructured content – even if it is just internally. Until this happens, analysis from a big data standpoint can’t happen. Enterprises will continue to face massive amounts of unstructured content filling up their servers with no one to make neither heads nor tails about its meaning.
The first step is Building Block 1 in our Smart Content Framework™ which is an Enterprise Metadata Repository. From there, organizations can begin to harness the power of their unstructured content, make sense of it, and use it for business advantages.