Developer Training for Apache SPARK

Spark logo



OBJECTIVES

Developer Training for Apache Spark prepares you to analyze and solve real-world problems using Apache Spark and associated tools in the enterprise data stack. With instructor-led training and interactive hands-on exercises you will work through the entire process of designing and building scalable solutions, including data ingest, SQL queries on large scale datasets, building scalable parallel ETL applications, analyzing data and training machine learning models on massive datasets, all using Spark technology stack.

PREREQUISITES

This course is best suited to developers, engineers, data scientists and analysts. No prior knowledge of Hadoop or Spark is required but it’s recommended that participant is familiar with software development in Java / Scala / Python or SQL. Participants are expected to bring own laptop to the class, everything else needed for training is provided.



COURSE CONTENT

Fundamentals of Spark for Distributed Computing

  • Challenges in distributed computing and how Spark address them
  • Functional programming in Spark
  • Interactive data analysis in Spark Shell, Spark Notebooks
  • Spark concepts and applications
  • Spark on cluster: HDFS, YARN, Spark History Server
  • Resilient Distributed Datasets (RDDs)
  • Intro to Spark DataFrames and Spark SQL

ETL with Spark

  • Programming with RDDs: Transformations and Actions
  • Lazy Evaluation, Memory and Persistence
  • Data Sources
    • HDFS, local filesystem, Avro, Parquet data formats
    • Ingesting Data from External Sources with Apache Flume
    • Ingesting Data from Relational Databases with Apache Sqoop
  • Understanding Data Locality, Shuffle, DAG Scheduler
  • Choosing data storage formats for different data usage patterns

Data Science on Spark

  • Ad hoc analysis with Spark SQL
  • DataFrames and DataSets
  • Scalable analysis with SparkR
  • Training and running ML models using MLlib
  • Graph processing in GraphX
  • Best practices for building analytical models on Spark

High Performance processing in Spark

  • Performance tuning, RDD caching, data partitioning
  • Optimizing joins, using broadcasts
  • Common patterns in Spark data processing
  • Debugging and optimizing Spark applications
  • Spark Streaming
  • Troubleshooting Spark applications


TUTOR

Instuructor Vladimir Smida

Vladimir Smida is a Big Data Engineer with strong analytical skills that spans from statistics to Machine Learning and Data Science. Over the years Vlad has architected and developed enterprise ready production systems based on Apache Hadoop and Apache Spark, in-memory real-time trading agents, NoSQL solutions. He worked as big data consultant for one of the biggest IT companies in the world and his experience ranges from Proof of Concepts to large (600+) node hadoop clusters. Outside working hours Vlad runs the biggest community of data scientists in Scandinavia called BigDataDenmark.dk