Bigdata
Large amount of data and data may be various types such as structured, unstructured, and semi-structured, the data which cannot processed by our traditional database applications are not enough. The challenges include storage, process, transfer, search, analysis and querying.
The characteristics of Big data
Volume(Size)
It determines the volume of the data the generated per every second. That data will be in Zettabytes or Brontobytes. If we look at airplanes they generate approximately 2.5 billion Terabyte of data each year from the sensors installed in the engines. Self-driving cars will generate 2 Petabyte of data every year. Also the agricultural industry generates massive amounts of data with sensors installed in tractors. Shell uses super-sensitive sensors to find additional oil in wells and if they install these sensors at all 10,000 wells they will collect approximately 10 Exabyte of data annually. That again is absolutely nothing if we compare it to the Square Kilometer Array Telescope that will generate 1 Exabyte of data per day.
Velocity (speed)
The term velocity refers to the speed generation of data or how fast the data is generated and processed to meet the demand and challenges.
The speed at which data is created currently is almost unimaginable: Every minute we upload 100 hours of video on YouTube. In addition, every minute over 200 million emails are sent, around 20 million photos are viewed and 30,000 uploaded on Flickr, almost 300,000 tweets are sent and almost 2.5 million queries on Google are performed.
Variety
It refers the variations or types of the data. Data classified into three types
Structured: the data which is fitted in the relational database tables such tables financial details, population census of every individual of the world
Semi-structured: Semi-structured data is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known asself-describing structure. Such as XML, HTML and Json data.
Unstructured: 90% of data in the world consists of unstructured data such as images, videos, sounds
Real-life Examples of the Big data
Search engines
E-Commerce
Social media
Fraud detection
Crime Prevention
Solutions for the Big Data
NoSQL: NoSQL is database environment. It designed for real-time interactive process Which supports the commodity hardware. Handles the large amount of the data. It has function of fast read and writes of data. It supports incremental, horizontal scaling, changing data formats. In the NoSQL we can store user transactions, sensor data, and customer profiles. Which provides the cluster support.
Hadoop: It supports incremental, horizontal scaling, changing data formats. It provides the batch and large scale processing Which supports the commodity hardware. Handles the large amount of the data. In Hadoop can store prediction analytics, fraud detections and recommendations. Which provides the cluster support
CASSANDRA: Which handles high volume of data. It offers quick installations and configure multi-node cluster. It scales from GB’s to PB’s of data. Is designed for continuous availability. It is the open source and cost less than 80% -90% RDBMS. Cassandra handles high velocity of data with ease (absence of difficulty). Uses schemas that support board varieties of data.
HADOOP
Hadoop is the open source frame work ,that allows the distributed processing of large data sets across the clusters of computers using simple programming model, which allows the storage and processing of the Big data in the distributed environment. Which is written in java, the developers of Hadoop is Apache Software Foundation. Which consists of two core components that are HDFS and Mapreduce.
History of the Hadoop
Hadoop origin of Hadoop is came from the Google File System paper (GFS) which was published in 2003. It consists Google Mapreduce which enable data processing over the large clusters. Hadoop is the sub project Apache Nutch in 2006. It was founded by Dough Cutting he is working in yahoo. Name of Hadoop is came from toy elephant of Dough cutting son.
2003 | October | Google File System paper released |
2004 | December | MapReduce: Simplified Data Processing on Large Clusters |
2006 | January | Hadoop subproject created with mailing lists, jira, and wiki |
2006 | January | Hadoop is born from Nutch 197 |
2006 | February | NDFS+ MapReduce moved out of Apache Nutch to create Hadoop |
2006 | February | Owen Omalley's first patch goes into Hadoop |
2006 | February | Hadoop is named after Cutting's son's yellow plush toy |
2006 | April | Hadoop 0.1.0 released |
2006 | April | Hadoop sorts 1.8TB on 188 nodes in 47.9 hours |
2006 | May | Yahoo deploys 300 machine Hadoop cluster |
2006 | October | Yahoo Hadoop cluster reaches 600 machines |
2007 | April | Yahoo runs 2 clusters of 1,000 machines |
2007 | June | Only 3 companies on "Powered by Hadoop Page" |
2007 | October | First release of Hadoop that includes HBase |
2007 | October | Yahoo Labs creates Pig, and donates it to the ASF |
2008 | January | YARN JIRA opened |
2008 | January | 20 companies on "Powered by Hadoop Page" |
2008 | February | Yahoo moves its web index onto Hadoop |
2008 | February | Yahoo! production search index generated by a 10,000-core Hadoop cluster |
2008 | March | First Hadoop Summit |
2008 | April | Hadoop world record fastest system to sort a terabyte of data. Running on a 910-node cluster, Hadoop sorted one terabyte in 209 seconds |
2008 | May | Hadoop wins TeraByte Sort (World Record sortbenchmark.org) |
2008 | July | Hadoop wins Terabyte Sort Benchmark |
2008 | October | Loading 10TB/day in Yahoo clusters |
2008 | October | Cloudera, Hadoop distributor is founded |
2008 | November | Google MapReduce implementation sorted one terabyte in 68 seconds |
2009 | March | Yahoo runs 17 clusters with 24,000 machines |
2009 | April | Hadoop sorts a petabyte |
2009 | May | Yahoo! used Hadoop to sort one terabyte in 62 seconds |
2009 | June | Second Hadoop Summit |
2009 | July | Hadoop Core is renamed Hadoop Common |
2009 | July | MapR, Hadoop distributor founded |
2009 | July | HDFS now a separate subproject |
2009 | July | MapReduce now a separate subproject |
2010 | January | Kerberos support added to Hadoop |
2010 | May | Apache HBase Graduates |
2010 | June | Third Hadoop Summit |
2010 | June | Yahoo 4,000 nodes/70 petabytes |
2010 | June | Facebook 2,300 clusters/40 petabytes |
2010 | September | Apache Hive Graduates |
2010 | September | Apache Pig Graduates |
2011 | January | Apache Zookeeper Graduates |
2011 | January | Facebook, LinkedIn, eBay and IBM collectively contribute 200,000 lines of code |
2011 | March | Apache Hadoop takes top prize at Media Guardian Innovation Awards |
2011 | June | Rob Beardon and Eric Badleschieler spin out Hortonworks out of Yahoo. |
2011 | June | Yahoo has 42K Hadoop nodes and hundreds of petabytes of storage |
2011 | June | Third Annual Hadoop Summit (1,700 attendees) |
2011 | October | Debate over which company had contributed more to Hadoop. |
2012 | January | Hadoop community moves to separate from MapReduce and replace with YARN |
2012 | June | San Jose Hadoop Summit (2,100 attendees) |
2012 | November | Apache Hadoop 1.0 Available |
2013 | March | Hadoop Summit - Amsterdam (500 attendees) |
2013 | March | YARN deployed in production at Yahoo |
2013 | June | San Jose Hadoop Summit (2,700 attendees) |
2013 | October | Apache Hadoop 2.2 Available |
2014 | February | Apache Hadoop 2.3 Available |
2014 | February | Apache Spark top Level Apache Project |
2014 | April | Hadoop summit Amsterdam (750 attendees) |
2014 | June | Apache Hadoop 2.4 Available |
2014 | June | San Jose Hadoop Summit (3,200 attendees) |
2014 | August | Apache Hadoop 2.5 Available |
2014 | November | Apache Hadoop 2.6 Available |
2015 | April | Hadoop Summit Europe |
2015 | June | Apache Hadoop 2.7 Available |
Highlights of the Hadoop
Yahoo uses world’s largest cluster with over 42,000 nodes running in 3 data centers.
Then Facebook which as 2000 nodes in 2010.
There are over 1000+ users.
THE COMPANIES THERE ARE WORKING WITH HADOOP
| Company | Business | Technical Specs | Uses |
1 | | Social Site | 8 cores and 12 TB of storage | Used as a source for reporting and machine learning |
2 | | Social site |
| Hadoop is used since 2010 to store and process tweets, log files using LZO compression technique as it is fast and also helps release CPU for other tasks. |
3 | | Social site | 2X4 and 2X6 cores – 6X2TB SATA 4100 nodes | LinkedIn's data flows through Hadoop clusters.User activity, server metrics, images,transaction logs stored in HDFS are used by data analysts for business analytics like discovering people you may know. |
4 | Yahoo! | Online Portal | 4500 nodes – 1TB storage, 16 GB RAM | Used for scaling tests |
5 | AOL | Online portal | ETL style processing and statistics generation | Targets machines and dual processors |
6 | EBay | Ecommerce | 4K+ nodes cluster | With 300+ million users browsing more than 350 million products listed on their website, eBay has one of the largest Hadoop clu 0 Like 0 Dislike Follow 0 Other Lessons for You Power View Power View is now a feature of Microsoft Excel 2013, and is part of the Microsoft SQL Server 2012 Reporting Services add-in for Microsoft SharePoint Server 2010 and 2013 Enterprise Editions. Power View... How can you recover from a NameNode failure in Hadoop cluster? How can you recover from a Namenode failure in Hadoop?Why is Namenode so important?Namenode is the most important Hadoop service. It contains the location of all blocks in the cluster. It maintains the... B SQL Join Types There are four basic types of SQL joins: inner, left, right, and full. The easiest and most intuitive way to explain the difference between these four types is by using a Venn diagram, which shows all... Up, Up And Up of Hadoop's Future The onset of Digital Architectures in enterprise businesses implies the ability to drive continuous online interactions with global consumers/customers/clients or patients. The goal is not just to provide... Lets look at Apache Spark's Competitors. Who are the top Competitors to Apache Spark today. Apache Spark is the most popular open source product today to work with Big Data. More and more Big Data developers are using Spark to generate solutions for Big Data problems. It is the de-facto standard... B Looking for Hadoop ?Learn from Best Tutors on UrbanPro. Are you a Tutor or Training Institute? Join UrbanPro Today to find students near youHadoop Questions X Looking for Hadoop Classes?The best tutors for Hadoop Classes are on UrbanPro
Learn Hadoop with the Best TutorsThe best Tutors for Hadoop Classes are on UrbanPro |