Apply for this course

Topic outline

  • General

    Spark logo  Spark logo


    https://goo.gl/forms/5GReyJyGwLUPpzXx2

    OBJECTIVES

    One-day Developer Training for Apache Spark provides you necessary essentials for processing and analyzing massive datasets using distributed paradigms on Apache Spark. With instructor-led training and interactive hands-on exercises you will learn to design and build application optimized for high performance that scales linear with increasing data volume/velocity, In the training we will also cover topics related to data science such as data exploration with Spark SQL and Machine Learning with Spark MLlib.

    PREREQUISITES

    This course is best suited to developers, engineers, data scientists and analysts. No prior knowledge of Hadoop or Spark is required but it’s recommended that participant is familiar with software development in Java / Scala / Python or SQL. Participants are expected to bring to the class with preinstalled software (see software requirements), everything else needed for training will be provided (information sources, virtual images, code templates, etc.). Participant will also get access to all materials used during the training which can be also used after training (personal use only).


    COURSE CONTENT

    Distributed Computing Fundamentals and Spark

    • Challenges in distributed computing and how Spark address them
    • Spark core concepts and applications
    • Spark on cluster: HDFS, YARN, Spark History Server
    • Resilient Distributed Datasets (RDDs)

    ETL with Spark

    • Programming with RDDs: Transformations and Actions
    • Lazy Evaluation, Memory and Persistence
    • Data Sources and Data Formats
    • Understanding Data Locality, Data Shuffle, DAG Scheduler

    Data Science on Spark

    • Interactive data analysis in Spark Shell and Spark Notebooks
    • Ad hoc analysis with Spark SQL
    • DataFrames and DataSets
    • Training and running ML models using MLlib

    Day-to-Day Spark

    • Debugging Spark applications
    • RDD caching, data partitioning
    • Performance tuning: data shuffle, broadcast variables
    • Debugging and optimizing Spark applications
    • Troubleshooting Spark applications

    TUTOR

    Instuructor Vladimir Smida

    Vladimir Smida is a Big Data Engineer with strong analytical skills that spans from statistics to Machine Learning and Data Science. Over the years Vlad has architected and developed enterprise ready production systems based on Apache Hadoop and Apache Spark, in-memory real-time trading agents, NoSQL solutions. He worked as big data consultant for one of the biggest IT companies in the world and his experience ranges from Proof of Concepts to large (600+) node hadoop clusters. Outside working hours Vlad runs the biggest community of data scientists in Scandinavia called BigDataDenmark.dk

    • This topic

      What to bring

      You are only required to bring a personal computer with you. In the class we might be working in virtual environment, therefore make sure your computer is equipped with:

      • RAM: recommended 16GB, minimum 8GB
      • Disk: minimum 20GB of available disk space
      • ethernet and wifi interface

      Please make sure you have necessary administrative rights to your computer in order to install software and enable virtualization in BIOS.

      • Course materials

        Note, following materials are only visible to course participants.
        If you would like to access course materials, enroll first.

      • Links

        • Before you leave...