Big Data : A Big Challenge

Anish Garg
15 min readSep 28, 2020

--

Have You Ever Wondered how much Data MNC’s are getting from the Users Everyday🤔🤔. So,Curious to know the answer??
Let’s find out together😅

⚜What is Data?

According to Daniel Keys MoranYou can have data without information, but you cannot have information without data”.
In computing, data is information that has been translated into a form that is efficient for movement or processing. Relative to today’s computers and transmission media, data is information converted into binary digital form. It is acceptable for data to be used as a singular subject or a plural subject. Raw Data is a term used to describe data in its most basic digital format.

⚜What is Big Data?

The term “Big Data” refers to data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods. The act of accessing and storing large amounts of information for analytics has been around a long time. But the concept of big data gained momentum in the early 2000s.
It origins from Social Media, Public Web, Archives, Docs , Media, Storage Data, Machine Log Data, Sensor data. With Big data, we will get actionable data that can be used to engage with customers one-on-one in real time. It is its high speed that the cutting-edge technologies owe.

🔰History Of BIG DATA

In 2005 Roger Mougalas from O’Reilly Media coined the term Big Data for the first time, only a year after they created the term Web 2.0. It refers to a large set of data that is almost impossible to manage and process using traditional business intelligence tools.
As more and more social networks start appearing and the Web 2.0 takes flight, more and more data is created on a daily basis. Innovative startups slowly start to dig into this massive amount of data and also governments start working on Big Data projects. In 2009 the Indian government decides to take an iris scan, fingerprint and photograph of all of tis 1.2 billion inhabitants. All this data is stored in the largest biometric database in the world.

In 2010 Eric Schmidt speaks at the Techonomy conference in Lake Tahoe in California and he states that “there were 5 exabytes of information created by the entire world between the dawn of civilization and 2003. Now that same amount is created every two days.”

🔰Big Data is categorized into three different types

SelectHub
  1. Structured Data:-

Any data that can be stored, accessed and processed in the form of fixed format is termed as a ‘structured’ data. Structured Data owns a dedicated data model, It also has a well-defined structure, it follows a consistent order and it is designed in such a way that it can be easily accessed and used by a person or a computer. Structured data is usually stored in well-defined columns and also Databases.

2. Unstructured Data:-

Any data with unknown form or the structure is classified as Unstructured Data. In addition to the size being huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it. A typical example of unstructured data is a heterogeneous data source containing a combination of simple text files, images, videos etc.

3. Semi-Structured Data

Semi-Structured Data can be considered as another form of Structured Data. It inherits a few properties of Structured Data, but the major part of this kind of data fails to have a definite structure and also, it does not obey the formal structure of data models such as an RDBMS.

Examples of Big Data

Daily we upload millions of bytes of data. 90 % of the world’s data has been created in last two years.

  • The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.
  • The New York Stock Exchange generates about one terabyte of new trade data per day.
  • A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per day, generation of data reaches up to many Petabytes.
  • YouTube users upload 48 hours of new video every minute of the day.
  • 294 billion emails are sent every day.

✔BIG DATA with 8 V’s

mbrain.com
complexsql.com
  1. Volume — The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data.

For example, Walmart processes 2.5 petabytes of data every hour. To put that into layman terms, one petabyte is the equivalent to 13.3 years of HD-TV Video. That’s a lot of data and that’s only in one hour for Walmart.

2. Variety — The Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.

3. Velocity — The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.

4. Veracity — Veracity refers more to the reliability of the data source, its context, and how meaningful it is to the analysis based on it. Not that all data that come for processing are valuable. So, unless the data is cleansed correctly, it is not wise to store or process complete data. If data is no good the results are no good.

5. Visualization — Big data processing not only means getting meaningful result. Unless and until it is not visualized in meaningful way there’s no point in analyzing it. You can’t rely on traditional graphs when trying to plot a billion data points, so you need different ways of representing data like data clustering or using tree maps, parallel coordinates, or circular network diagrams.

Michael Walker

6. Value — One of the most important characteristics is value. The primary interest for big data is probably for its business value. Value is the ability to convert Big Data information into a monetary reward. There is a lot of Big Data research happening that is driven exclusively by a profit motive such as the research being used to analyze the human genome. The other characteristics of big data are meaningless if you don’t derive business value from the data.

7. Viscosity — Viscosity measures the resistance to flow in the volume of data. This resistance can come from different data sources, friction from integration flow rates, and processing required to turn the data into insight. Technologies to deal with viscosity include improved streaming, agile integration bus’, and complex event processing.

8. Virality — Virality describes how quickly information gets dispersed across people to people (P2P) networks. Virality measures how quickly data is spread and shared to each unique node. Time is a determinant factor along with rate of spread.

How Does Big Data Work?🤔

The principle of big data is very simple: The more knowledge you have about anything or any situation, the more accurate predictions you can make about the future.

Big data projects use cutting-edge analytics involving artificial intelligence and machine learning and tools like Apache Hadoop, Apache Spark, NoSQL, Hive, Sqoop, etc. to process messy data. They process the data generated from various mediums like your social media activities, search engines, sensors, etc. and extract insights from it, which helps in making decisions and predictions for various big data applications.

There is no second thought in saying that big data is revolutionizing the world of business across almost every industry. As I said earlier, Companies can accurately predict what specific segments of customers will want to buy, and when, to an incredibly accurate degree. Also, big data is helping companies run their operations in a much more efficient way.

BIG DATA🔢=BIG MONEY💰💰

It’s no secret that big data is essential for business, assuming you can trust your data. It’s the key to maintaining visibility into your business operations so that you can keep your organization lean and mean. Quality data also helps you plan and measure the success of everything from marketing campaigns to new product releases. If your data is trustworthy, it is one of your most strategic assets.

In general, however, it’s clear that there’s big money in big data. Consider the following statistics:

  • The big data market is worth $138.9 billion in 2020 and is expected to top $229.4 billion by 2025. That’s a measure of how much companies were investing in big data, not how much value they were deriving from it. Still, it provides a sense of just how much financial capital enterprises are pouring into data operations.
  • According to a 2019 McKinsey report, companies with the greatest overall growth and revenue earnings are three times more likely than other companies to say that their data and analytics initiatives have contributed at least 20 percent to earnings before interest and taxes (EBIT) over the past three years.
  • In its 2019 Big Data and AI Executive Survey, NewVantage Partners notes that 62.2 percent of respondents have achieved measurable results from their big data investments, with general improvements in the areas of advanced analytics (79.8 percent), expense reduction (59.5 percent), customer service (57.1 percent), and speed to market (32.1 percent).

⚜HOW BIG IS BIG DATA?🤔

We have entered the Age of the Data for good. Everything we do online and even offline leaves traces in data — from cookies to our social media profiles. So how much data there really is? How much data do we process on a daily basis? Welcome to the Zettabyte Era.

@LorilLewis

Data is measured in bits and bytes. One bit contains a value of 0 or 1. Eight bits make a byte. Then we have kilobytes (1,000 bytes), megabytes (100⁰² bytes), gigabytes (100⁰³ bytes), terabytes (100⁰⁴ bytes), petabytes (100⁰⁵ bytes), exabytes (100⁰⁶ bytes) and zettabytes (100⁰⁷ bytes or (1,000,000,000,000,000,000,000 bytes). One zettabyte is equal to a thousand exabytes, a billion terabytes, or a trillion gigabytes.
In other words — that’s a lot!

Internet traffic is only one part of the total data storage, which includes also all personal and business devices. Estimates for the total data storage capacity which we have right now, in 2019, vary, but are already in 10–50 zettabyte range. By 2025 this is estimated to grow to the range of 150–200 zettabytes.

Definitely data creation will only fasten in upcoming years, so you might wonder: is there any limit to data storage? Not really, or rather, there are limits, but are so far away that we won’t get anywhere near them anytime soon. For example, just a gram of DNA can store 700 terabytes of data, which means that we could store all our data we have right now on 1500kg of DNA — packed densely, it would fit into an ordinary room. That however is very far from what we are able to manufacture currently. The largest hard drive being manufactured has 15 terabytes, and the largest SSD reaches 100 terabytes.

The term Big Data refers to a dataset which is too large or too complex for ordinary computing devices to process. As such, it is relative to the available computing power on the market.

🔰Ever Wondered, How these MNC’s stores that much amount of data??

So, These problems of Big data is Solved by using the concept called Distributing Storage.

Distributing storage is an attempt to offer the advantages of centralized storage with the scalability and cost base of local storage. A Distributed Storage System (DSS) formed, by networking together a large number of, inexpensive and unreliable, storage devices provides one such alternative to store such a massive amount of data with high reliability and ubiquitous availability.

For Example: Suppose in above diagram there are 100’s of Data Node and each Node has a capacity to store only 10GB Data and all the Data Node are connected to a Name Node who stored all the information of Data Node. The Clients are connected from a NameNode.
Now a Client wants to store a file of 1000GB in NameNode but it can only store maximum 10GB Data, So now Name Node will break the file of 1000GB in blocks of size(1000/Numbers of DataNode) and then saved each block in different DataNode. This will give us a unlimited storage just by adding DataNode and it takes very less time to store the data because we are breaking the 1000GB file in 100 files of 10GB each. In this way we can store Big Data by creating a cluster of NameNode and hundreds thousands of DataNode.

Edureka

🔰Master-Slave Cluster

Master/slave is a model of communication for hardware devices where one device has a unidirectional control over one or more devices.

According to this diagram it has a three system which were sharing its content or storage into one system which is called master and other system which were sharing the data to master via networking is called slave. So, This model is called Master-Slave Model. That’s why we need to implement the Distributing Storage concept. And a whole Team of Master-Slave Model is called cluster.

HADOOP- BIG DATA SAVIOR

Hadoop is a collection of open-source software utilities that facilitate a network of many computers to solve problems involving massive amounts data and computation. It provides a software framework for distributed storage of big data using the MapReduce programming model.

It is an open-source software developed as a project by Apache Software Foundation. Doug Cutting created Hadoop. In the year 2008 Yahoo gave Hadoop to Apache Software Foundation. Hadoop comes in various flavors like Cloudera, IBM BigInsight, MapR and Hortonworks.

In Hadoop, the resource of master- slave model has been changed. Master is called Name Node(NN) and slave is called Data Node(NN) and in clusters every system are called nodes.So, Every nodes are connected via networking. In hadoop it used the protocol called HDFS (Hadoop Distributed File System).

The Distributed storage stores the data in parallel by splitting/stripping the data. So that it will store the data in less time. Data stripping is done by Master Node/Name Node and it transfers data to all the respective Data Nodes/Slave nodes within seconds. Here, each Node refers to a single machine. Every nodes are connected via network and to communicate via network we need a protocol. So,Hadoop uses a protocol called
HDFS(Hadoop Distributed File System).

Master Node also known as NameNode:

  • The Client contacts HDFS Master- Name Node to access the files in cluster.
  • Name Node has the Meta data and it takes care of client authentication, space allocation to actual data and details about actual storage location.
  • Name node also maintains slave node, assign task to them and had a track of slave node performance failure etc.
  • It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live.

Slave Node also known as DataNode:

  • These are slave daemons or process which runs on each slave machine.
  • The actual data is stored on DataNodes.
  • The DataNodes perform the low-level read and write requests from the file system’s clients.
  • They send heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds

⚜Features of HDFS

  • Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that’s a key consideration.
  • Computing power. Hadoop’s distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.
  • Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.
  • Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.
  • Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data.
  • Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required.

Applications of Big Data

Big Data is considered the most valuable and powerful fuel that can run the massive IT industries of the 21st Century. Big Data is being the most wide-spread technology that is being used in almost every business sector. Let us now check out a few as mentioned below.

  1. Communication in IT sector:
    One of the largest users of Big Data, IT companies around the world are using Big Data to optimize their functioning, enhance employee productivity, and minimize risks in business operations. By combining Big Data technologies with ML and AI, the IT sector is continually powering innovation to find solutions even for the most complex of problems.
  2. Travel and Tourism :
    It is one of the biggest users of Big Data Technology. It has enabled us to predict the requirements for travel facilities in many places, improving business through dynamic pricing and many more. In countries across the world, both private and government-run transportation companies use Big Data technologies to optimize route planning, control traffic, manage road congestion, and improve services.
  3. Medical:
    Big Data has already started to create a huge difference in the healthcare sector. With the help of predictive analytics, medical professionals and Health Care Personnel are now able to provide personalized healthcare services to individual patients. Apart from that, fitness wearables, telemedicine, remote monitoring — all powered by Big Data and AI — are helping change lives for the better.
  4. Financial and Banking:
    Sectors extensively uses Big Data Technology. Big data analytics can aid banks in understanding customer behaviour based on the inputs received from their investment patterns, shopping trends, motivation to invest and personal or financial backgrounds.
  5. IOT:
    Big Data and the Internet of Things work together as vast network of sensors (IoT) collect a boatload of information (big data) that is then used to improve services and products in various industries, which in turn generate revenue — hundreds of billions of dollars annually — from those products and services. And the cash flow is speeding up.
    By 2025, according to Thingstream, “it’s expected that more than 100 billion operational devices will be connected to the Internet of Things (IoT), generating a total revenue of nearly $10 trillion.

Wrapping Up

I hope am able to answer the “What is Big Data?” question clearly enough. Hope you understood about the types of big data, characteristics of big data, how BigData are stored, Distributed storage etc.
I hope I have done justice with the topic fairly enough .I tried my best to left no stone unturned for this particular topic and I will be glad to welcome your suggestions.

Thank you for reading and being with me until here. I hope that you guys have found it helpful and informative.
So please don’t hesitate anymore, Press the clap icon and feel free to give your suggestions and support in the comment section below🥳🥳

KEEP READING KEEP GROWING.

--

--

Anish Garg
Anish Garg

Written by Anish Garg

Computer related tech, Cyber Security, Cloud Computing

No responses yet