End-To-End Pipeline Training

Scala By the Bay and Big Data Scala are joined by a day (8/16) of unique Complete End-to-End Data Pipeline Training, when hundreds of engineers will build, in one day, a complete analytics startup with

  • Mesos (Mesosphere DCOS) as a platform by Mesosphere
  • Akka-based API by Typesafe
  • Kafka message bus by Confluent
  • Spark streaming by Databricks
  • Cassandra for persistence by Datastax
  • Spark Notebook by Andy Petrella

Overview

You can take API in a box (Play), message bus in a box (Kafka), and even Big Data ML in a box (Spark with MLLib). Why can’t you get the whole thing in a box — e.g. Twitter in a box or Pinterest in a box? Big Fast Data in a Box would be such a project you could just take off the shelf and build yourself a Twitter, where millions of users can interact with your app via API from smart phones, web clients, etc., and you can analyze their behavior in real time, running streaming and batch jobs to uncover and match patterns, and generate reactive actions and notifications. While is no single off-the-shelf project right now (although framework.foundation is working on one), but there’s an already established way to put together several OSS components to create data pipelines.

There’s a set of best practices for building data pipelines, well represented in the talks at Big Data Scala (bigdatascala.bythebay.io). In a typical user-facing application such as Twitter or Pinterest, there are millions of clients accessing an API, that sends data for processing and real-time analytics, with possible follow-up actions, and then there are batch runs on the data as it is persisted. One such pipeline, becoming fairly common, is outlined above. Since the number of clients varies during the day and seasonally, and the computations over data vary as well, scalable infrastructure is required to accommodate for peak demand (such as morning auctions for some online retailers). We need to treat the data center as a computer, to allocate resources efficiently. We’ll use Mesos to run our stack.

In this whole day training, we’ll share a Git repository with a unified application incorporating all of the components above, and use it to build the whole system in one day.

Goals

  • Official training by the parent company for each step
  • Interoperable by design, developed by the five companies together
  • Co-developed with recognized experts in overall data pipelines
  • Have students build a single application composed of these technologies
  • Students learn each individual step and also their interoperation, at a high level

Steps: Five Key Components from Five Key Companies

Each step will provide a concise and intensive training on a key component of the data pipeline, and will take 90 minutes.

Mesos / Mesosphere

The training will start with the Mesosphere presentation and deployment tutorial. Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively. Mesos is built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elastic Search) with API’s for resource management and scheduling across entire datacenter and cloud environments. Mesos provides scalability to 10,000s of nodes, fault-tolerant replicated master and slaves using ZooKeeper, support for Docker containers, native isolation between tasks with Linux containers, multi-resource scheduling (memory, CPU, disk, and ports), Java, Python and C++ APIs for developing new parallel applications, and more.

Most websites today are distributed systems, made up of app servers, databases, and caches. Managing distributed systems is tricky, and writing distributed systems is even trickier. Mesosphere's mission is to make building and running these systems as easy as building or running an app on your smartphone.

Akka / Typesafe

Akka is a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM using actors. Actors are very lightweight concurrent entities. They process messages asynchronously using an event-driven receive loop. Pattern matching against messages is a convenient way to express an actor's behavior. They raise the abstraction level and make it much easier to write, test, understand and maintain concurrent and/or distributed systems. You focus on workflow—how the messages flow in the system—instead of low level primitives like threads, locks and socket IO. Akka-HTTP (formerly Spray) is one of the fastest REST API platforms in existence.

Typesafe is dedicated to helping developers build Reactive applications on the JVM. With the Typesafe Reactive Platform, including Play Framework, Akka, and Scala, developers can deliver highly responsive user experiences backed by a resilient and message driven application stack that scales effortlessly on multicore and cloud computing architectures.

Kafka / Confluent

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers. Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact. Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.

Confluent uses Kafka as a hub to sync data between all types of systems, from batch systems that load data infrequently to realtime systems that require low-latency access. Confluent make this easy, reliable, secure, and auditable.

Spark / Databricks

Apache Spark is an in-memory cluster computing system written in Scala. It provides a Mp/Reduce API almost identical to Scala Collections API, enabling parallel programming in the large. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Access data in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source. Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.

Databricks was founded out of the UC Berkeley AMPLab by the creators of Apache Spark. It been working for the past six years on cutting-edge systems to extract value from Big Data. It believes that Big Data is a huge opportunity that is still largely untapped, and is working to revolutionize what you can do with it.

Cassandra / DataStax

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages. Cassandra's data model offers the convenience of column indexes with the performance of log- structured updates, strong support for denormalization and materialized views, and powerful built-in caching.

DataStax is the industry leader in developing solutions based on commercially supported, enterprise-ready Apache Cassandra, the open source NoSQL database technology widely-acknowledged as the best foundation for tackling the most challenging big data problems.