UrbanPro
true

Learn Hadoop from the Best Tutors

  • Affordable fees
  • 1-1 or Group class
  • Flexible Timings
  • Verified Tutors

Search in

Big Data

Kranthi Kumar Kandula
13/02/2017 0 0

Bigdata

Large amount of data and data may be various types such as structured, unstructured, and semi-structured, the data which cannot processed by our traditional database applications are not enough. The challenges include storage, process, transfer, search, analysis and querying.

The characteristics of Big data

Volume(Size)

It determines the volume of the data the generated per every second. That data will be in Zettabytes or Brontobytes. If we look at airplanes they generate approximately 2.5 billion Terabyte of data each year from the sensors installed in the engines. Self-driving cars will generate 2 Petabyte of data every year. Also the agricultural industry generates massive amounts of data with sensors installed in tractors. Shell uses super-sensitive sensors to find additional oil in wells and if they install these sensors at all 10,000 wells they will collect approximately 10 Exabyte of data annually. That again is absolutely nothing if we compare it to the Square Kilometer Array Telescope that will generate 1 Exabyte of data per day.

Velocity (speed)

The term velocity refers to the speed generation of data or how fast the data is generated and processed to meet the demand and challenges.

The speed at which data is created currently is almost unimaginable: Every minute we upload 100 hours of video on YouTube. In addition, every minute over 200 million emails are sent, around 20 million photos are viewed and 30,000 uploaded on Flickr, almost 300,000 tweets are sent and almost 2.5 million queries on Google are performed.

Variety

It refers the variations or types of the data. Data classified into three types

Structured: the data which is fitted in the relational database tables such tables financial details, population census of every individual of the world

Semi-structured: Semi-structured data is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known asself-describing structure. Such as XML, HTML and Json data.

Unstructured: 90% of data in the world consists of unstructured data such as images, videos, sounds

Real-life Examples of the Big data

Search engines

E-Commerce

Social media

Fraud detection

Crime Prevention

Solutions for the Big Data

 NoSQL: NoSQL is database environment. It designed for real-time interactive process Which supports the commodity hardware. Handles the large amount of the data. It has function of fast read and writes of data. It supports incremental, horizontal scaling, changing data formats. In the NoSQL we can store user transactions, sensor data, and customer profiles. Which provides the cluster support.

Hadoop: It supports incremental, horizontal scaling, changing data formats. It provides the batch and large scale processing Which supports the commodity hardware. Handles the large amount of the data. In Hadoop can store  prediction analytics, fraud detections and recommendations. Which provides the cluster support

CASSANDRA: Which handles high volume of data. It offers quick installations and configure multi-node cluster. It scales from GB’s to PB’s of data. Is designed for continuous availability. It is the open source and cost less than 80% -90% RDBMS. Cassandra handles high velocity of data with ease (absence of difficulty). Uses schemas that support board varieties of data.

                                                          HADOOP

Hadoop is the open source frame work ,that allows the distributed processing of large data sets across the clusters of computers using simple programming model, which allows the storage and processing of the Big data in the distributed environment. Which is written in java,  the developers of Hadoop is Apache Software Foundation. Which consists of two core components that are HDFS and Mapreduce.

History of the Hadoop

Hadoop origin of Hadoop is came from the Google File System paper (GFS) which was published in  2003. It consists Google Mapreduce which enable data processing over the large clusters.  Hadoop is the sub project  Apache Nutch in 2006. It was founded by Dough Cutting he is working in yahoo. Name of Hadoop is came from toy elephant of Dough cutting son.

 

 

2003

October

Google File System paper released

2004

December

MapReduce: Simplified Data Processing on Large Clusters

2006

January

Hadoop subproject created with mailing lists, jira, and wiki

2006

January

Hadoop is born from Nutch 197

2006

February

NDFS+ MapReduce moved out of Apache Nutch to create Hadoop

2006

February

Owen Omalley's first patch goes into Hadoop

2006

February

Hadoop is named after Cutting's son's yellow plush toy

2006

April

Hadoop 0.1.0 released

2006

April

Hadoop sorts 1.8TB on 188 nodes in 47.9 hours

2006

May

Yahoo deploys 300 machine Hadoop cluster

2006

October

Yahoo Hadoop cluster reaches 600 machines

2007

April

Yahoo runs 2 clusters of 1,000 machines

2007

June

Only 3 companies on "Powered by Hadoop Page"

2007

October

First release of Hadoop that includes HBase

2007

October

Yahoo Labs creates Pig, and donates it to the ASF

2008

January

YARN JIRA opened

2008

January

20 companies on "Powered by Hadoop Page"

2008

February

Yahoo moves its web index onto Hadoop

2008

February

Yahoo! production search index generated by a 10,000-core Hadoop cluster

2008

March

First Hadoop Summit

2008

April

Hadoop world record fastest system to sort a terabyte of data. Running on a 910-node cluster, Hadoop sorted one terabyte in 209 seconds

2008

May

Hadoop wins TeraByte Sort (World Record sortbenchmark.org)

2008

July

Hadoop wins Terabyte Sort Benchmark

2008

October

Loading 10TB/day in Yahoo clusters

2008

October

Cloudera, Hadoop distributor is founded

2008

November

Google MapReduce implementation sorted one terabyte in 68 seconds

2009

March

Yahoo runs 17 clusters with 24,000 machines

2009

April

Hadoop sorts a petabyte

2009

May

Yahoo! used Hadoop to sort one terabyte in 62 seconds

2009

June

Second Hadoop Summit

2009

July

Hadoop Core is renamed Hadoop Common

2009

July

MapR, Hadoop distributor founded

2009

July

HDFS now a separate subproject

2009

July

MapReduce now a separate subproject

2010

January

Kerberos support added to Hadoop

2010

May

Apache HBase Graduates

2010

June

Third Hadoop Summit

2010

June

Yahoo 4,000 nodes/70 petabytes

2010

June

Facebook 2,300 clusters/40 petabytes

2010

September

Apache Hive Graduates

2010

September

Apache Pig Graduates

2011

January

Apache Zookeeper Graduates

2011

January

Facebook, LinkedIn, eBay and IBM collectively contribute 200,000 lines of code

2011

March

Apache Hadoop takes top prize at Media Guardian Innovation Awards

2011

June

Rob Beardon and Eric Badleschieler spin out Hortonworks out of Yahoo.

2011

June

Yahoo has 42K Hadoop nodes and hundreds of petabytes of storage

2011

June

Third Annual Hadoop Summit (1,700 attendees)

2011

October

Debate over which company had contributed more to Hadoop.

2012

January

Hadoop community moves to separate from MapReduce and replace with YARN

2012

June

San Jose Hadoop Summit (2,100 attendees)

2012

November

Apache Hadoop 1.0 Available

2013

March

Hadoop Summit - Amsterdam (500 attendees)

2013

March

YARN deployed in production at Yahoo

2013

June

San Jose Hadoop Summit (2,700 attendees)

2013

October

Apache Hadoop 2.2 Available

2014

February

Apache Hadoop 2.3 Available

2014

February

Apache Spark top Level Apache Project

2014

April

Hadoop summit Amsterdam (750 attendees)

2014

June

Apache Hadoop 2.4 Available

2014

June

San Jose Hadoop Summit (3,200 attendees)

2014

August

Apache Hadoop 2.5 Available

2014

November

Apache Hadoop 2.6 Available

2015

April

Hadoop Summit Europe

2015

June

Apache Hadoop 2.7 Available

 

Highlights of the Hadoop

  Yahoo uses world’s largest cluster with over 42,000 nodes running in 3 data centers.

Then Facebook which as 2000 nodes in 2010.

There are over 1000+ users.

THE COMPANIES THERE ARE WORKING WITH HADOOP

 

 

Company

Business

Technical Specs

Uses

1

Facebook

Social Site

8 cores and 12 TB of storage

Used as a source for reporting and machine learning

2

Twitter

Social site

 

Hadoop is used since 2010 to store and process tweets, log files using LZO compression technique as it is fast and also helps release CPU for other tasks.

3

LinkedIn

Social site

2X4 and 2X6 cores – 6X2TB SATA

4100 nodes 

LinkedIn's data flows through Hadoop clusters.User activity, server metrics, images,transaction logs stored in HDFS are used by data analysts for business analytics like discovering people you may know.

4

Yahoo!

Online Portal

4500 nodes – 1TB storage, 16 GB RAM

Used for scaling tests

5

AOL

Online portal

ETL style processing and statistics generation

Targets machines and dual processors

6

EBay

Ecommerce

4K+ nodes cluster

With 300+ million users browsing more than 350 million products listed on their website, eBay has one of the largest Hadoop clu

0 Dislike
Follow 0

Please Enter a comment

Submit

Other Lessons for You

Power View
Power View is now a feature of Microsoft Excel 2013, and is part of the Microsoft SQL Server 2012 Reporting Services add-in for Microsoft SharePoint Server 2010 and 2013 Enterprise Editions. Power View...

How can you recover from a NameNode failure in Hadoop cluster?
How can you recover from a Namenode failure in Hadoop?Why is Namenode so important?Namenode is the most important Hadoop service. It contains the location of all blocks in the cluster. It maintains the...
B

Biswanath Banerjee

0 0
0

SQL Join Types
There are four basic types of SQL joins: inner, left, right, and full. The easiest and most intuitive way to explain the difference between these four types is by using a Venn diagram, which shows all...

Up, Up And Up of Hadoop's Future
The onset of Digital Architectures in enterprise businesses implies the ability to drive continuous online interactions with global consumers/customers/clients or patients. The goal is not just to provide...

Lets look at Apache Spark's Competitors. Who are the top Competitors to Apache Spark today.
Apache Spark is the most popular open source product today to work with Big Data. More and more Big Data developers are using Spark to generate solutions for Big Data problems. It is the de-facto standard...
B

Biswanath Banerjee

1 0
0
X

Looking for Hadoop Classes?

The best tutors for Hadoop Classes are on UrbanPro

  • Select the best Tutor
  • Book & Attend a Free Demo
  • Pay and start Learning

Learn Hadoop with the Best Tutors

The best Tutors for Hadoop Classes are on UrbanPro

This website uses cookies

We use cookies to improve user experience. Choose what cookies you allow us to use. You can read more about our Cookie Policy in our Privacy Policy

Accept All
Decline All

UrbanPro.com is India's largest network of most trusted tutors and institutes. Over 55 lakh students rely on UrbanPro.com, to fulfill their learning requirements across 1,000+ categories. Using UrbanPro.com, parents, and students can compare multiple Tutors and Institutes and choose the one that best suits their requirements. More than 7.5 lakh verified Tutors and Institutes are helping millions of students every day and growing their tutoring business on UrbanPro.com. Whether you are looking for a tutor to learn mathematics, a German language trainer to brush up your German language skills or an institute to upgrade your IT skills, we have got the best selection of Tutors and Training Institutes for you. Read more