Intel’s Big Data 101 pdf has an excellent introduction and definition to big data and one that is very pertinent to this section.
“Data is exploding at an astounding rate. While it took from the dawn of civilization to 2003 to create 5 exabytes of information, we now create that same volume in just two days! By 2012, the digital universe of data will grow to 2.72 zettabytes (ZB) and will double every two years to reach 8 ZB by 2015. For perspective: That’s the equivalent of 18 million Libraries of Congress. Billions of connected devices—ranging from PCs and smartphones to sensor devices such as RFID readers and traffic cams—generate this flood of complex structured and unstructured data.
Big data refers to huge data sets characterized by larger volumes (by orders of magnitude) and greater variety and complexity, generated at a higher velocity than your organization has faced before. These three key characteristics are sometimes described as the three Vs of big data.
Unstructured data is heterogeneous and variable in nature and comes in many formats, including text, document, image, video, and more. Unstructured data is growing faster than structured data. According to a 2011 IDC study, it will account for 90 percent of all data created in the next decade. As a new, relatively untapped source of insight, unstructured data analytics can reveal important interrelationships that were previously difficult or impossible to determine.
Big data analytics is a technology-enabled strategy for gaining richer, deeper, and more accurate insights into customers, partners, and the business—and ultimately gaining competitive advantage. By processing a steady stream of real-time data, organizations can make time-sensitive decisions faster than ever before, monitor emerging trends, course-correct rapidly, and jump on new business opportunities.”
There are a multitude of industries that have the necessity to address the challenge of Big Data and there is also a flourishing portfolio of technologies coming of age to assist in the processing of the problem.
An article in Wikipedia defines “Big Science” experiments as an excellent example of Big Data:
…the Large Hadron Collider Experiment where there is about 150 million sensors delivering data 40 million times per second. There are nearly 600 million collisions per second. After filtering and refraining from recording more than 99.999% of these streams, there are 100 collisions of interest per second.
As a result, only working with less than 0.001% of the sensor stream data, the data flow from all four LHC experiments represents 25 petabytes annual rate before replication (as of 2012). This becomes nearly 200 petabytes after replication.
If all sensor data were to be recorded in LHC, the data flow would be extremely hard to work with. The data flow would exceed 150 million petabytes annual rate, or nearly 500 exabytes per day, before replication. To put the number in perspective, this is equivalent to 500 quintillion (5×1020) bytes per day, almost 200 times higher than all the other sources combined in the world.”
Other industries that handle such enormous volumes of information are the likes of ISPs such as Google who have made massive headway into addressing the problem with their creation of “Map Reduce” and GFS technologies being used to create the Hadoop platform. Other companies and organisations that actively leverage Hadoop as just one example for the processing of Big Data are:
- Human Genome Project
- Hewlett Packard
- The Financial Times
- Fox Interactive Media