training

 

Machine Learning with Spark – 4 Days

Course Description

Spark is a new and very popular Big Data processing engine. Spark MLLib is a de facto standard for machine learning in Big Data.

This course is intended for data scientists and software engineers. It maintains an optimal balance of theory and practice. For each machine learning concept, we first discuss the foundations, its applicability and limitations. Then we explain the implementation and use, and specific use cases. This is achieved through a combination of about 50% lecture, 50% lab work.

What You Will Learn

  • attain thorough understanding of popular machine learning algorithms, their applicability and limitations
  •     practice the application of these methods in the Spark machine learning environment
  •     achieve clarity in the real-world use of machine learning by illustrating each method with practical use cases

Intended Audience

Data Scientists and Software Engineers

Lab Environment

Working Spark environment will be provided for students.  Students would only need an SSH client and a browse.

Zero Install : There is no need to install software on students’ machines.

Prerequisites

  •   familiarity with programming in at least one language
  •     be able to navigate Linux command line
  •     basic knowledge of command line Linux editors (VI / nano)

Outline

Section 1: Introductions and overviews

  • Machine learning: goals, results, supervised/unsupervised
  • Spark as a tool for Big Data
  • Scala as the language of Spark (together with Python, Java and R)
    If the students do not have the Spark/Scala prerequisites, a thorough introduction of these is taught in the section

Section 2: SVM (Supervised Vector Machines)

  • Theory
  • Lab
  • Use case: anomaly detection

Section 3: Logistic Regression

  • Theory
  • Lab
  • Use case: healthcare prediction

Section 4: Linear regression

  • Theory
  • Lab
  • Use case: financial modelling

Section 5: Naive Bayes

  • Theory
  • Lab
  • Use case: spam filtering

Section 6: Decision Trees

  • Theory
  • Lab
  • Use case: vessel shipment planning

Section 7: Clustering (K-Means)

  • Theory
  • Lab
  • Use case: topic grouping

Section 8: LDA (Latent Dirichlet Allocation)

  • Theory
  • Lab
  • Use case: unsupervised topic discovery

Section 9: Principal Component Analysis (PCA)

  • Theory
  • Lab
  • Use case: stock analysis

Section 10: Recommendation (Collaborative filtering)

  • Theory
  • Lab
  • Use case: dating

Section 11: Graphs – graph operations

  • Theory
  • Lab
  • Use case: finding followers

Section 12: Graphs – optimizations with Pregel

  • Theory
  • Lab
  • Use case: shortest routes, PageRank

MindIQ

Print Friendly, PDF & Email