Big Data : The Untold Story

7 min readSep 17, 2020

How big MNC’s like Google, Facebook, Instagram etc stores, manages and manipulate Thousands of Terabytes of data with High Speed and High Efficiency?

:Anirudhi Thanvi

What is Big Data? Why it is in so demand?

Over past 20 years, data stream processing has been the one of the important research area. Numerous technological innovation are driving the dramatic increase in data and data gathering.The term big data is tossed around in the business and tech world pretty frequently.

“Big data” is high-volume, velocity, and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” ~ Gartner

Well, Bigdata is not a technique or any kind of software. It’s a problem which every big industry is facing and it occurs due to the huge amount of data indifferent forms have to be handled, processed and analysed properly.

The concept of big data gained momentum in the early 2000's when industry analyst Doug Laney articulated the now-mainstream definition of big data as the three V’s:

Volume: Organizations collect data from a variety of sources, including business transactions, smart (IoT) devices, industrial equipment, videos, social media and more. In the past, storing it would have been a problem — but cheaper storage on platforms like data lakes and Hadoop have eased the burden. Size of data plays very crucial role in determining value out of data.
Velocity: With the growth in the Internet of Things, data streams in to businesses at an unprecedented speed and must be handled in a timely manner. RFID tags, sensors and smart meters are driving the need to deal with these torrents of data in near-real time. Velocity refers to the speed of generation of data.
Variety: Data comes in all types of formats — from structured, numeric data in traditional databases to unstructured text documents, emails, videos, audios, stock ticker data and financial transactions.

List of some Big Data Tools Used by Industry:

Apache Hadoop
Apache Spark
Flink
Apache Storm
Apache Cassandra
MongoDB
Kafka
Tableau

These are the technologies software used by different companies to store the billions and trillions of their data in a efficient way.

Who are facing this problem?

Big data may be new for startups and for online firms, but many large firms view it as something they have been wrestling with for a while. This contextual analysis collection will give you a decent overview how these FANNG companies and big MNC’s leverage big data to drive business performance.

1. Netflix

Netflix has over 100 million subscribers and with that comes a wealth of data they can analyze to improve the user experience. Big data has helped Netflix massively in their mission to become the king of stream. So, how does Netflix use data analytics?

The Netflix data platform is a massive-scale, cloud-only suite of tools and technologies. Netflix started with a more traditional MySQL database for data warehousing, storing more than 10 years of customer data and billions of ratings. However, the growth of data collected by Netflix started to increase exponentially as the service started to shift towards Internet streaming.

Engineers at Netflix created Aegisthus to create a ‘bulk data pipeline’ out of Cassandra, as Cassandra is more ideal for individual queries. The output from Aegisthus is stored on Amazon S3 once a day, which is what Netflix is now using as its main data warehouse. The data stored in S3 is analyzed with Hadoop. Netflix uses S3 as the backend instead of HDFS for its extremely high durability and versioning capabilities, even though doing so adds a little latency.

If you see the statistics of netflix, 85% of US streaming subscribers subscribe to Netflix. At the end of 2019, Netflix subscribers numbered 167.1 million. Of these, 61 million accounts were registered in the US, with the remaining 106.1 million (63%) spread over the rest of the globe. Kurt Brown leads the data platform team at Netflix, which architects and manages the technical infrastructure underpinning the company’s analytics, including various big data technologies like Hadoop, Spark, and Presto, machine learning infrastructure for Netflix data scientists, and traditional BI tools including Tableau.

2. Facebook

According to a report, from 2017 to 2019 the total number of social media users has been increased from 2.46 to 2.77 billion. Facebook generates 4 petabytes of data per day — that’s a million gigabytes. All that data is stored in what is known as the Hive, which contains about 300 petabytes of data.

Every minute on Facebook: 510,000 comments are posted, 293,000 statuses are updated, and 136,000 photos are uploaded. 38,000 status updates. · Facebook users also click the like button on more than 4 million posts every minute!, and the Facebook like button has been pressed 13 trillion times. · 4.3 BILLION Facebook messages posted daily! Facebook has revealed that it is generating around 500+ terabytes of data every day. In which 2.7 billion were likes and around 300 million photos per day. Another exciting thing is Facebook is scanning around 105 terabytes of data per each half hour.

Source : DIGITAL 2019: GLOBAL DIGITAL OVERVIEW

With such large incoming data rates and the fact that more and more historical data needs to be held in the cluster in order to support historical analysis, space usage is a constant constraint for the ad hoc Hive-Hadoop cluster. Facebook uses Hadoop for its big data store and Hive for parallel map-reduce queries against that store.

3. Google

We all know Google is the only one who can answer any kind of question!! We simply conclude that Google knows everything!! And Everything means Everything! Now you must be wondering how much data does google handle to answer all these questions!!?? Google doesn’t provide numbers on how much data they store. · Over 3.5 Billion Google searches are conducted worldwide each minute of everyday. That is 2 trillion searches per year worldwide. That is over 40,000 search queries per second!

Google currently processes over 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its massive computing clusters. In creating Spanner, Google had built a planet-spanning distributed database that allowed its global datacentres to keep in sync without suffering huge latencies. For Cutting, Spanner shows the future possibilities for open source distributed processing platforms like Hadoop.

4. Instagram

Instagram mainly uses two backend database systems: PostgreSQL and Cassandra. Both PostgreSQL and Cassandra have mature replication frameworks that work well as a globally consistent data store.

5. Amazon

The company generates more than $258,751.90 in sales and service fees per minute.Amazon is dominating the marketplace — Amazon processes $373 MILLION in sales every day in 2017, compared to about 120 million amazon sales in 2014. Each month more than 206 million people around the world get on their devices and visit Amazon.com.

6. Some More General Statistics

This is the sixth edition of DOMO’s report, and according to their research:
“Over 2.5 quintillion bytes of data are created every single day, and it’s only going to grow from there. By 2020, it’s estimated that 1.7MB of data will be created every second for every person on earth.”

Twitter community generates more than 12 terabytes of data per day that equals 84 terabytes per week and 4368 terabytes or 4.3 petabytes per year.· Since 2013, the number of Tweets each minute has increased 58% to more than 474,000 Tweets PER MINUTE in 2019!
According to computer giant IBM, 2.5 exabytes — that’s 2.5 billion gigabytes (GB) — of data was generated every day in 2012.
Worldwide over 100 million messages are sent every minute via SMS and in-app messages!
More than 4 million hours of content uploaded to Youtube every day, 300 hours of video are uploaded to YouTube every minute! with users watching 5.97 billion hours of Youtube videos each day. Youtube usage more than tripled from 2014–2016 with users uploading 400 hours of new video each minute of every day! Now, in 2019, users are watching 4,333,560 videos every minute.
Walmart is the largest retailer in the world and the world’s largest company by revenue, with more than 2 million employees and 20000 stores in 28 countries.Hadoop and NoSQL technologies are used to provide internal customers with access to real-time data collected from different sources and centralized for effective use.

“The big data is indeed a buzz world but it is one that frankly under-hyped”

You can have data without information but you cannot have information without data and for data you require storage and for storage you need to have good resources and proper knowledge of how to handle any data.

Thank you!!! For giving you valuable time and reading till the end…
Hope you liked it and found this article helpful…

Well, Bigdata is not a technique or any kind of software. It’s a problem which every big industry is facing and it occurs due the huge amount of data indifferent forms.