Course Outline
- Introduction
- Hadoop history, concepts
- Ecosystem
- Distributions
- High level architecture
- Hadoop myths
- Hadoop challenges (hardware / software)
- Labs: discuss your Big Data projects and problems
- Planning and installation
- Selecting software, Hadoop distributions
- Sizing the cluster, planning for growth
- Selecting hardware and network
- Rack topology
- Installation
- Multi-tenancy
- Directory structure, logs
- Benchmarking
- Labs: cluster install, run performance benchmarks
- HDFS operations
- Concepts (horizontal scaling, replication, data locality, rack awareness)
- Nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
- Health monitoring
- Command-line and browser-based administration
- Adding storage, replacing defective drives
- Labs: getting familiar with HDFS command lines
- Data ingestion
- Flume for logs and other data ingestion into HDFS
- Sqoop for importing from SQL databases to HDFS, as well as exporting back to SQL
- Hadoop data warehousing with Hive
- Copying data between clusters (distcp)
- Using S3 as complementary to HDFS
- Data ingestion best practices and architectures
- Labs: setting up and using Flume, the same for Sqoop
- MapReduce operations and administration
- Parallel computing before mapreduce: compare HPC vs Hadoop administration
- MapReduce cluster loads
- Nodes and Daemons (JobTracker, TaskTracker)
- MapReduce UI walk through
- Mapreduce configuration
- Job config
- Optimizing MapReduce
- Fool-proofing MR: what to tell your programmers
- Labs: running MapReduce examples
- YARN: new architecture and new capabilities
- YARN design goals and implementation architecture
- New actors: ResourceManager, NodeManager, Application Master
- Installing YARN
- Job scheduling under YARN
- Labs: investigate job scheduling
- Advanced topics
- Hardware monitoring
- Cluster monitoring
- Adding and removing servers, upgrading Hadoop
- Backup, recovery and business continuity planning
- Oozie job workflows
- Hadoop high availability (HA)
- Hadoop Federation
- Securing your cluster with Kerberos
- Labs: set up monitoring
- Optional tracks
- Cloudera Manager for cluster administration, monitoring, and routine tasks; installation, use. In this track, all exercises and labs are performed within the Cloudera distribution environment (CDH5)
- Ambari for cluster administration, monitoring, and routine tasks; installation, use. In this track, all exercises and labs are performed within the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0)
Requirements
- comfortable with basic Linux system administration
- basic scripting skills
Knowledge of Hadoop and Distributed Computing is not required, but will be introduced and explained in the course.
Lab environment
Zero Install : There is no need to install hadoop software on students’ machines! A working hadoop cluster will be provided for students.
Students will need the following
- an SSH client (Linux and Mac already have ssh clients, for Windows Putty is recommended)
- a browser to access the cluster. We recommend Firefox browser with FoxyProxy extension installed
Testimonials (4)
Trainer's preparation & organization, and quality of materials provided on github.
Mateusz Rek - MicroStrategy Poland Sp. z o.o.
Course - Impala for Business Intelligence
The VM I liked very much The Teacher was very knowledgeable regarding the topic as well as other topics, he was very nice and friendly I liked the facility in Dubai.
Safar Alqahtani - Elm Information Security
Course - Big Data Analytics in Health
Liked very much the interactive way of learning.
Luigi Loiacono
Course - Data Analysis with Hive/HiveQL
I mostly liked the trainer giving real live Examples.