Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.
Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. You’ll learn about recent changes to Hadoop, and explore new case studies on Hadoop’s role in healthcare systems and genomics data processing.
- Learn fundamental components such as MapReduce, HDFS, and YARN
- Explore MapReduce in depth, including steps for developing applications with it
- Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN
- Learn two data formats: Avro for data serialization and Parquet for nested data
- Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer)
- Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop
- Learn the HBase distributed database and the ZooKeeper distributed configuration service
Author(s): Tom White
Ready to use statistical and machine-learning techniques across large data sets? This practical guide shows you why the Hadoop ecosystem is perfect for the job. Instead of deployment, operations, or software development usually associated with distributed computing, you’ll focus on particular analyses you can build, the data warehousing techniques that Hadoop provides, and higher order data workflows this framework can produce.
Data scientists and analysts will learn how to perform a wide range of techniques, from writing MapReduce and Spark applications with Python to using advanced modeling and data management with Spark MLlib, Hive, and HBase. You’ll also learn about the analytical processes and data systems available to build and empower data products that can handle—and actually require—huge amounts of data.
- Understand core concepts behind Hadoop and cluster computing
- Use design patterns and parallel analytical algorithms to create distributed data analysis jobs
- Learn about data management, mining, and warehousing in a distributed context using Apache Hive and HBase
- Use Sqoop and Apache Flume to ingest data from relational databases
- Program complex Hadoop and Spark applications with Apache Pig and Spark DataFrames
- Perform machine learning techniques such as classification, clustering, and collaborative filtering with Spark’s MLlib
Author(s): Benjamin Bengfort, Jenny Kim
3. Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem (Addison-wesley Data & Analytics Series) (2015)
Get Started Fast with Apache Hadoop® 2, YARN, and Today’s Hadoop Ecosystem
With Hadoop 2.x and YARN, Hadoop moves beyond MapReduce to become practical for virtually any type of data processing. Hadoop 2.x and the Data Lake concept represent a radical shift away from conventional approaches to data usage and storage. Hadoop 2.x installations offer unmatched scalability and breakthrough extensibility that supports new and existing Big Data analytics processing methods and models.
Hadoop® 2 Quick-Start Guide is the first easy, accessible guide to Apache Hadoop 2.x, YARN, and the modern Hadoop ecosystem. Building on his unsurpassed experience teaching Hadoop and Big Data, author Douglas Eadline covers all the basics you need to know to install and use Hadoop 2 on personal computers or servers, and to navigate the powerful technologies that complement it.
Eadline concisely introduces and explains every key Hadoop 2 concept, tool, and service, illustrating each with a simple “beginning-to-end” example and identifying trustworthy, up-to-date resources for learning more.
This guide is ideal if you want to learn about Hadoop 2 without getting mired in technical details. Douglas Eadline will bring you up to speed quickly, whether you’re a user, admin, devops specialist, programmer, architect, analyst, or data scientist.
- Understanding what Hadoop 2 and YARN do, and how they improve on Hadoop 1 with MapReduce
- Understanding Hadoop-based Data Lakes versus RDBMS Data Warehouses
- Installing Hadoop 2 and core services on Linux machines, virtualized sandboxes, or clusters
- Exploring the Hadoop Distributed File System (HDFS)
- Understanding the essentials of MapReduce and YARN application programming
- Simplifying programming and data movement with Apache Pig, Hive, Sqoop, Flume, Oozie, and HBase
- Observing application progress, controlling jobs, and managing workflows
- Managing Hadoop efficiently with Apache Ambari–including recipes for HDFS to NFSv3 gateway, HDFS snapshots, and YARN configuration
- Learning basic Hadoop 2 troubleshooting, and installing Apache Hue and Apache Spark
Author(s): Douglas Eadline
4. Hadoop in 24 Hours, Sams Teach Yourself (2017)
Apache Hadoop is the technology at the heart of the Big Data revolution, and Hadoop skills are in enormous demand. Now, in just 24 lessons of one hour or less, you can learn all the skills and techniques you'll need to deploy each key component of a Hadoop platform in your local environment or in the cloud, building a fully functional Hadoop cluster and using it with real programs and datasets. Each short, easy lesson builds on all that's come before, helping you master all of Hadoop's essentials, and extend it to meet your unique challenges. Apache Hadoop in 24 Hours, Sams Teach Yourself covers all this, and much more:
- Understanding Hadoop and the Hadoop Distributed File System (HDFS)
- Importing data into Hadoop, and process it there
- Mastering basic MapReduce Java programming, and using advanced MapReduce API concepts
- Making the most of Apache Pig and Apache Hive
- Implementing and administering YARN
- Taking advantage of the full Hadoop ecosystem
- Managing Hadoop clusters with Apache Ambari
- Working with the Hadoop User Environment (HUE)
- Scaling, securing, and troubleshooting Hadoop environments
- Integrating Hadoop into the enterprise
- Deploying Hadoop in the cloud
- Getting started with Apache Spark
Step-by-step instructions walk you through common questions, issues, and tasks; Q-and-As, Quizzes, and Exercises build and test your knowledge; “Did You Know?” tips offer insider advice and shortcuts; and “Watch Out!” alerts help you avoid pitfalls. By the time you're finished, you'll be comfortable using Apache Hadoop to solve a wide spectrum of Big Data problems.
Author(s): Jeffrey Aven
To reinforce those lessons, the book’s second section provides detailed examples of architectures used in some of the most commonly found Hadoop applications. Whether you’re designing a new Hadoop application, or planning to integrate Hadoop into your existing data infrastructure, Hadoop Application Architectures will skillfully guide you through the process.
This book covers:
- Factors to consider when using Hadoop to store and model data
- Best practices for moving data in and out of the system
- Data processing frameworks, including MapReduce, Spark, and Hive
- Common Hadoop processing patterns, such as removing duplicate records and using windowing analytics
- Giraph, GraphX, and other tools for large graph processing on Hadoop
- Using workflow orchestration and scheduling tools such as Apache Oozie
- Near-real-time stream processing with Apache Storm, Apache Spark Streaming, and Apache Flume
- Architecture examples for clickstream analysis, fraud detection, and data warehousing
Author(s): Mark Grover, Ted Malaska
6. Hadoop For Dummies (For Dummies Series) (2014)
Let Hadoop For Dummies help harness the power of your data and rein in the information overload
Big data has become big business, and companies and organizations of all sizes are struggling to find ways to retrieve valuable information from their massive data sets with becoming overwhelmed. Enter Hadoop and this easy-to-understand For Dummies guide. Hadoop For Dummies helps readers understand the value of big data, make a business case for using Hadoop, navigate the Hadoop ecosystem, and build and manage Hadoop applications and clusters.
- Explains the origins of Hadoop, its economic benefits, and its functionality and practical applications
- Helps you find your way around the Hadoop ecosystem, program MapReduce, utilize design patterns, and get your Hadoop cluster up and running quickly and easily
- Details how to use Hadoop applications for data mining, web analytics and personalization, large-scale text processing, data science, and problem-solving
- Shows you how to improve the value of your Hadoop cluster, maximize your investment in Hadoop, and avoid common pitfalls when building your Hadoop cluster
From programmers challenged with building and maintaining affordable, scaleable data systems to administrators who must deal with huge volumes of information effectively and efficiently, this how-to has something to help you with Hadoop.
Author(s): Dirk deRoos
Hadoop has changed the way large data sets are analyzed, stored, transferred, and processed. At such low cost, it provides benefits like supports partial failure, fault tolerance, consistency, scalability, flexible schema, and so on. It also supports cloud computing. More and more number of individuals are looking forward to mastering their Hadoop skills.
While initiating with Hadoop, most users are unsure about how to proceed with Hadoop. They are not aware of what are the pre-requisite or data structure they should be familiar with. Or How to make the most efficient use of Hadoop and its ecosystem. To help them with all these queries and other issues this e-book is designed.
The book gives insights into many of Hadoop libraries and packages that are not known to many Big data Analysts and Architects. The e-book also tells you about Hadoop MapReduce and HDFS. The example in the e-book is well chosen and demonstrates how to control Hadoop ecosystem through various shell commands. With this book, users will gain expertise in Hadoop technology and its related components. The book leverages you with the best Hadoop content with the lowest price range.
After going through this book, you will also acquire knowledge on Hadoop Security required for Hadoop Certifications like CCAH and CCDH. It is a definite guide to Hadoop.
Chapter 1: What Is Big Data
Examples Of ‘Big Data’
Categories Of ‘Big Data’
Characteristics Of ‘Big Data’
Advantages Of Big Data Processing
Chapter 2: Introduction to Hadoop
Components of Hadoop
Features Of ‘Hadoop’
Network Topology In Hadoop
Chapter 3: Hadoop Installation
Chapter 4: HDFS
Access HDFS using JAVA API
Access HDFS Using COMMAND-LINE INTERFACE
Chapter 5: Mapreduce
How MapReduce works
How MapReduce Organizes Work?
Chapter 6: First Program
Understanding MapReducer Code
Explanation of SalesMapper Class
Explanation of SalesCountryReducer Class
Explanation of SalesCountryDriver Class
Chapter 7: Counters & Joins In MapReduce
Two types of counters
Chapter 8: MapReduce Hadoop Program To Join Data
Chapter 9: Flume and Sqoop
What is SQOOP in Hadoop?
What is FLUME in Hadoop?
Some Important features of FLUME
Chapter 10: Pig
Introduction to PIG
Create your First PIG Program
PART 1) Pig Installation
PART 2) Pig Demo
Chapter 11: OOZIE
What is OOZIE?
How does OOZIE work?
Example Workflow Diagram
Oozie workflow application
Why use Oozie?
FEATURES OF OOZIE
Author(s): Krishna Rungta
Rather than run through all possible scenarios, this pragmatic operations guide calls out what works, as demonstrated in critical deployments.
- Get a high-level overview of HDFS and MapReduce: why they exist and how they work
- Plan a Hadoop deployment, from hardware and OS selection to network requirements
- Learn setup and configuration details with a list of critical properties
- Manage resources by sharing a cluster across multiple groups
- Get a runbook of the most common cluster maintenance tasks
- Monitor Hadoop clusters–and learn troubleshooting with the help of real-world war stories
- Use basic tools and techniques to handle backup and catastrophic failure
Author(s): Eric Sammer
9. Hadoop in Action (2010)
Hadoop in Action teaches readers how to use Hadoop and write MapReduce programs. The intended readers are programmers, architects, and project managers who have to process large amounts of data offline. Hadoop in Action will lead the reader from obtaining a copy of Hadoop to setting it up in a cluster and writing data analytic programs.
The book begins by making the basic idea of Hadoop and MapReduce easier to grasp by applying the default Hadoop installation to a few easy-to-follow tasks, such as analyzing changes in word frequency across a body of documents. The book continues through the basic concepts of MapReduce applications developed using Hadoop, including a close look at framework components, use of Hadoop for a variety of data analysis tasks, and numerous examples of Hadoop in action.
Hadoop in Action will explain how to use Hadoop and present design patterns and practices of programming MapReduce. MapReduce is a complex idea both conceptually and in its implementation, and Hadoop users are challenged to learn all the knobs and levers for running Hadoop. This book takes you beyond the mechanics of running Hadoop, teaching you to write meaningful programs in a MapReduce framework.
This book assumes the reader will have a basic familiarity with Java, as most code examples will be written in Java. Familiarity with basic statistical concepts (e.g. histogram, correlation) will help the reader appreciate the more advanced data processing examples.
Purchase of the print book comes with an offer of a free PDF, ePub, and Kindle eBook from Manning. Also available is all code from the book.
Author(s): Chuck Lam
10. Hadoop BIG DATA Interview Questions You’ll Most Likely Be Asked (Job Interview Questions Series) (Volume 11) (2017)
• 76 HR Interview Questions
• Real life scenario based questions
• Strategies to respond to interview questions
• 2 Aptitude Tests
Hadoop BIG DATA Interview Questions You’ll Most Likely Be Asked is a perfect companion to stand ahead above the rest in today’s competitive job market. Rather than going through comprehensive, textbook-sized reference guides, this book includes only the information required immediately for job search to build an IT career. This book puts the interviewee in the driver’s seat and helps them steer their way to impress the interviewer.
The following is included in this book: a) 200 Hadoop BIG DATA Interview Questions, Answers and Proven Strategies for getting hired as an IT professional
b) Dozens of examples to respond to interview questions
c) 76 HR Questions with Answers and Proven strategies to give specific, impressive, answers that help nail the interviews
d) 2 Aptitude Tests download available on www.vibrantpublishers.com
Author(s): Vibrant Publishers
11. Hadoop in Practice: Includes 104 Techniques (2014)
Hadoop in Practice, Second Edition provides over 100 tested, instantly useful techniques that will help you conquer big data, using Hadoop. This revised new edition covers changes and new features in the Hadoop core architecture, including MapReduce 2. Brand new chapters cover YARN and integrating Kafka, Impala, and Spark SQL with Hadoop. You’ll also get new and updated techniques for Flume, Sqoop, and Mahout, all of which have seen major new versions recently. In short, this is the most practical, up-to-date coverage of Hadoop available anywhere.
Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.
About the Book
It’s always a good time to upgrade your Hadoop skills! Hadoop in Practice, Second Edition provides a collection of 104 tested, instantly useful techniques for analyzing real-time streams, moving data securely, machine learning, managing large-scale clusters, and taming big data using Hadoop. This completely revised edition covers changes and new features in Hadoop core, including MapReduce 2 and YARN. You’ll pick up hands-on best practices for integrating Spark, Kafka, and Impala with Hadoop, and get new and updated techniques for the latest versions of Flume, Sqoop, and Mahout. In short, this is the most practical, up-to-date coverage of Hadoop available.
Readers need to know a programming language like Java and have basic familiarity with Hadoop.
- Thoroughly updated for Hadoop 2
- How to write YARN applications
- Integrate real-time technologies like Storm, Impala, and Spark
- Predictive analytics using Mahout and RR
- Readers need to know a programming language like Java and have basic familiarity with Hadoop.
About the Author
Alex Holmes works on tough big-data problems. He is a software engineer, author, speaker, and blogger specializing in large-scale Hadoop projects.
Table of Contents
- Hadoop in a heartbeat
- Introduction to YARN
- Data serialization—working with text and beyond
- Organizing and optimizing data in HDFS
- Moving data into and out of Hadoop
- Applying MapReduce patterns to big data
- Utilizing data structures and algorithms at scale
- Tuning, debugging, and testing
- SQL on Hadoop
- Writing a YARN application
PART 1 BACKGROUND AND FUNDAMENTALS
PART 2 DATA LOGISTICS
PART 3 BIG DATA PATTERNS
PART 4 BEYOND MAPREDUCE
Author(s): Alex Holmes
12. Hadoop: The Definitive Guide (2015)
Hadoop: The Definitive Guide [Tom White (Author)] on Amazon.com. *FREE* shipping on qualifying offers.
Author(s): Tom White (Author)