training

 

Big Data Essentials Bootcamp – 5 Days

Course Description

Big Data needs proper tools and skills, and this workshop brings you “from zero to hero,” that is, provides the student with the necessary knowledge of Hadoop, Spark, and NoSQL. With these three fundamentals, you will be able to build systems processing massive amounts of data, in archival, batch, interactive and finally real-time manner. The workshop also lays foundations for proper analytics, allowing to extract insights from data.

What You Will Learn

  • Hadoop: HDFS, MapReduce, Pig, Hive
  • Spark: Spark core, SparkSQL, Spark Java API, Spark Streaming
  • NoSQL: Cassandra/HBase architecture, Java API, drivers, data modeling

Intended Audience

Developers – Format:  50% lecture, 50% hands-on labs

Lab Environment

Zero Install: There is no need to install Hadoop, Spark, etc. software on students’ machines! Working clusters and environments will be provided for students.

Students will need the following

  • an SSH client (Linux and Mac already have ssh clients, for Windows Putty is recommended)
  • a browser to access the cluster

Prerequisites

Students should be familiar with the Java programming language (most programming exercises are in Java) and comfortable in Linux environment (i.e., be able to navigate Linux command line, edit files using vi / nano).

Outline

  • Hadoop
    • Introduction to Hadoop
      Hadoop history, concepts
      ecosystem
      distributions
      High-level architecture
      Hadoop myths
      Hadoop challenges
      hardware / softwareHDFS Overview
      concepts (horizontal scaling, replication, data locality, rack awareness)
      architecture (Namenode, Secondary NameNode, DataNode)
      data integrity
      future of HDFS : Namenode HA, Federation
      lab exercisesMapReduce Overview
      MapReducee concepts
      phases : driver, mapper, shuffle/sort, reducer
      thinking in MapReduce
      future of mapreduce (yarn)
      lab exercisesPig
      pig vs java vs MapReduce
      pig latin language
      user defined functions
      understanding pig job flow
      basic data analysis with Pig
      complex data analysis with Pig
      multi datasets with Pig
      advanced concepts
      lab exercisesHive
      hive concepts
      architecture
      data types
      Hive data management
      hive vs sql
      lab exercises
    • Spark
      Spark BasicsBackground and history
      Spark and hadoop
      Spark concepts and architecture
      Spark eco system (core, spark sql, mlib, streaming)
      First look at Spark
      Spark in local mode
      Spark web UI
      Spark shell
      Analyzing dataset – part 1
      Inspecting RDDsRDDs In DepthPartitions
      RDD Operations / transformations
      RDD types
      MapReduce on RDD
      Caching and persistence
      Sharing cached RDDsSpark API programmingIntroduction to Spark API / RDD API
      Submitting the first program to Spark
      Debugging / logging
      Configuration propertiesSpark Streaming

      Streaming overview
      Streaming operations
      Sliding window operations
      Writing spark streaming applications

      NoSQL

      Introduction to Big Data / NoSQL
      NoSQL overview
      CAP theorem
      When is NoSQL appropriate
      NoSQL ecosystem
      Cassandra Basics
      Cassandra nodes, clusters, datacenters
      Keyspaces, tables, rows and columns
      Partitioning, replication, tokens
      Quorum and consistency levels
      Labs

      Cassandra drivers
      Introduction to Java driver
      CRUD (Create / Read / Update, Delete) operations using Java client
      Asynchronous queries
      Labs

      Data Modeling – part 1
      introduction to CQL
      CQL Datatypes
      creating keyspaces & tables
      Choosing columns and types
      Choosing primary keys
      Data layout for rows and columns
      Time to live (TTL), create, insert, update
      Querying with CQL
      CQL updates
      Labs

      Data Modeling – part 2
      Creating and using secondary indexes
      Denormalization and join avoidance
      composite keys (partition keys and clustering keys)
      Time series data
      Best practices for time series data
      Counters
      Lightweight transactions (LWT)

      Data Modeling Labs : Group design sessions
      multiple use cases from various domains are presented
      students work in groups to come up designs and models
      discuss various designs, analyze decisions
      Lab : implement ‘Netflix’ data models, generate data

MindIQ

Print Friendly, PDF & Email