Big Data Definition and Value Proposition

There is a lot of buzz surrounding Big Data and I am sure you are just as excited as I am about the opportunities. Indeed, these are exhilarating times for technologists who are looking to transform traditional information consumption methods. The advent of new technologies and paradigms help us expand our capabilities exponentially.

Brief History

At the start of this millenia, as usage of the internet grew (especially for transactions), the amount of data being generated grew with it. Companies wanted to leverage the growing data to gain competitive edge. Data Warehousing and Business Intelligence concepts were born. However, companies like Yahoo and others soon realized that the growth was exponential. It was becoming difficult to process such large amounts of data (ETL into the BI tools), before a new set arrived the following day. There was also the issue of handling unstructured data (not previously identified as useful) which contained invaluable information. Consequently a large portion of data was being either archived or discarded.

Google made public a paper in 2003 that outlined a distributed file system known as Google File System (GFS). Two developers Doug Cutting and Mike Cafarella created Hadoop using the GFS concept under the Apache Software Foundation’s open-source software framework. Hadoop enabled organizations to handle extremely large amounts of data without the over head of having to know everything about the data before it can be stored i.e. the fields in a record, their types, their format, etc. All that could be done while extracting or reading information as needed. This significantly altered the tremendous value that could be extracted from data that was being ingested into the internet by its users. Companies have figured out ways to leverage this to enhance their Business Intelligence practice to gain additional business insights and make better decisions.

What is Big Data?

Big Data is the term used to define a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

Google, Facebook, Amazon, Apple and Microsoft know a lot about you, more than what you think they know. A small portion of this is evidenced by the recommendation you receive while making a financial transaction online (such as buying a book, a song, watching a video) or even simply browsing the internet. These companies have clearly demonstrated the tremendous business value in taming Big Data.

These companies and many like them leverage data for predicting consumer behavior. They have successfully leveraged machine learning in the context of Big Data i.e. prediction based on known properties learned from the training data. This is not to be confused with Data Mining, which focuses on the discovery of (previously) unknown properties in the data. I will cover this later in detail in another post.

Going back to the definition, what is considered Big Data varies based on the capabilities of an organization managing the data set, and on the capabilities of the applications that are traditionally used to process and analyze the data set within its domain. For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration. What ever the differences, an organization needs to look at Big Data within the following context.



How big is the data that needs to be processed? Data that may hold key information about various facets of our business.


How varied is the data (structured, unstructured, semi-structured, linked and dynamic) that we need to process to extract some meaningful information out of it.


How often do we get updates or what’s the speed at which we get our raw data that has information needed to gain competitive advantage?


Is the data trustworthy, authentic and dependable? Do we have information that may or may-not be available for forming meaningful correlations, which may impact our judgement / decision?


What is the expected value that can be derived? How do we determine the value of data based on correlations and its use for predictability (patterns or future actions)?

Why Should I Care?

As you may have read, experiments at the Large Hadron Collider at CERN generates 40 terabytes of data every second —  more than can be stored or analyzed. So scientists collect what they can and let the rest go un-preserved. As per a study published by International Data Corp (IDC), a market-research firm, around 1,200 exabytes of digital data was generated in 2010. In 2008, researchers at the University of California in San Diego (UCSD) examined the flow of data to American households. They found that such households were bombarded with 3.6 zettabytes of information (or 34 gigabytes per person per day). In the past information consumption was largely passive, leaving aside the analog / verbal data. Today half of all bytes are received interactively, according to the UCSD. Further details about the study can be found on the Econmist.

As per the Economist: ‘Only 5% of the information that is created is “structured”, meaning it comes in a standard format of words or numbers that can be read by computers. The rest are things like photos and phone calls which are less easily retrievable and usable. But this is changing as content on the web is increasingly “tagged”, and facial-recognition and voice-recognition software can identify people and words in digital files’.

Here’s an extract from a well written piece by Kenneth Cukier: ‘Wal-Mart, a retail giant, handles more than 1m customer transactions every hour, feeding databases estimated at more than 2.5 petabytes—the equivalent of 167 times the books in America’s Library of Congress (see article for an explanation of how data are quantified). Facebook, a social-networking website, is home to 40 billion photos. And decoding the human genome involves analysing 3 billion base pairs—which took ten years the first time it was done, in 2003, but can now be achieved in one week’.


In recent years Oracle, IBM, Microsoft and SAP between them have spent more than $15 billion on buying software firms specializing in data management and analytics. Capabilities of digital devices soar and the amount of information digitized grows by the minute in various forms such as images (contracts, checks, invoices, photos, QR code), videos (research findings, tutorials, etc), email and text on social media. As per Cisco, 667 exabytes of information will be generated in 2013 alone.

The quantitative change has begun to make a qualitative difference. Information is no more scarce, it is abundant. As Craig Mundie, head of research and strategy at Microsoft states: “What we are seeing is the ability to have economies form around the data—and that to me is the big change at a societal and even macroeconomic level”.

Rollin Ford, the CIO of Wal-Mart has said: “Every day I wake up and ask, ‘how can I flow data better, manage data better, analyse data better?”. This truly, as some folks term it, is an equivalent of “Industrial Revolution” of the Information Age.

What is the Value Proposition?

The following are some of the key values derived from implementing a robust Big Data strategy.

Business Performance

Improve business performance by converting enormous data into intelligent information that can be consumed by decision makers at all levels.

Cost Reduction

Find new opportunities to decrease cost and investment; Dynamically price products and services based on predictive behavior and additional discovery.

Utilization Improvement

Use improved analytics to gain greater utilization from equipments, facilities, money, personnel, other tangible and intangible assets.

Productivity Improvement

Enhance customer response and increase resource throughput; Decrease delays; Find ways to improve processes and technologies.

Additional Facts to Chew-on

Some additional facts from around the web to think about.

In 2012, the Obama administration announced the Big Data Research and Development Initiative, which explored how big data could be used to address important problems faced by the government. The initiative was composed of 84 different Big Data programs spread across six departments.
The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of climate observations and simulations on the Discover supercomputing cluster.
FICO Falcon Credit Card Fraud Detection System protects 2.1 billion active accounts world-wide.
Windermere Real Estate uses anonymous GPS signals from nearly 100 million drivers to help new home buyers determine their typical drive times to and from work throughout various times of the day. currently uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer recommendations, and merchandising. handles millions of back-end operations every hour, as well as queries from more than half a million third-party sellers daily. The core technology that keeps Amazon running is Linux-based and as of 2005 they had the world’s three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB.
There are over 5 billion mobile-phone subscriptions worldwide and there are between nearly 2 billion people accessing the internet.
It is predicted that the amount of traffic flowing over the internet will reach 667 exabytes annually by 2013.

In conclusion, Big Data can be harnessed to address the challenges that arise when information is dispersed across several different systems and are not interconnected by a central system. By aggregating data across systems, Big Data approach can help improve decision-making capabilities.