Notes from Hadoop Summit 2016

This was my first time attending a Hadoop Summit (thanks to my employer Yahoo for sponsoring!). I had a very positive impression of the event. Talks covered a range of topics and speakers were available for questions and hallway discussions. There were many vendors showcasing their wares as well.

I mostly attended talks related to streaming / real time systems as that’s directly related to the kind of work I do.

Vendor stalls

  • There was a lot of selling of Machine Learning based analytics tools and dashboards.
  • There were quite a few companies selling cloud based ways to get started using big data tools without having to buy on premise infrastructure and hire associated personnel. Good ones: Qubole, Pentaho
  • There were a few tools for metadata management and discovery. Good one was Waterline Data
  • MapR’s free ebooks are awesome (Hadoop Buyer’s Guide, Streaming Architecture etc —

The vendors stalls brought out the heavy competition among data processing tools that can overwhelm anyone new to the space (see Hadoop Buyer’s Guide mentioned above). There was a talk called “The data ecosystem is too damn big” by Andrew Brust of Datameer that I didn’t attend, but heard people passionately discussing the points made therein. This interview at the conference with the CEO of a new data integration company echoes similar sentiments:

Interesting talks (Chronological)

All the talks are listed here (with a brief description when you click on its name), including links to video and slides: (unfortunately there’s no direct link to each talk).

There is a YouTube playlist as well:

Day 1

  1. End-to-End Processing of 3.7 Million Telemetry Events per Second Using Lambda Architecture (Symantec and Hortonworks)
    This talk was a run through of a huge number of config parameters that these companies tuned to get a system that could process their workload. Apparently it took them 6 months, 20 people from Symantec and many more from Hortonworks to accomplish this. It was a comprehensive ETL pipeline with every framework and querying engine you can think of having a place somewhere in it. Definitely worth going back and looking at again.
  2. File Format Benchmark — Avro, JSON, ORC, and Parquet (Owen O’Malley of Hortonworks)
    This one was a rundown of a bunch of benchmarks the speaker performed across the 4 file formats. It was interesting for the quick background he provided about how these various formats came into being. Two of the data sets he used are publicly available and worth looking at: the NY city taxi rides data set, and the Github activity log data set.
    A brief summary of the results:
    – Compression matters. Apparently, JSON is really bad at getting compressed. Avro puts markers after every 16K. This affects compression window for zlib.
    – His recommendation was to use ORC with zlib compression by default; Avro is good for complex tables with common strings, when used with Snappy compression.
  3. Netflix: Spark on Yarn for ETL at Petabyte scale
    Netflix has decided to let ML people use Spark for their analysis and thus needed to support it as part of their data platform. This talk documented the struggles to get Spark working with Yarn, S3 and Parquet, since those are the other technologies used inside of Netflix.
    – Some basic numbers: They process 700 billion events per day, which comes to 3 Petabytes per day. Their warehouse has 40 Petabytes. Their Yarn cluster is 3000 EC2 nodes on 2 clusters (d2.4xlarge).
     — The talk mentioned all the relevant JIRA’s they filed with various projects as they went through their issues. I will mention a few of them: Coalesce() for RDD based on file size, optimization of UnionRDD for listing parent RDD’s, getting HadoopOutputCommitter to work well with S3, poor performance of broadcast variable seed performance, incorrect locality optimization on S3, Fixing Spark’s job history server, switching to G1 GC… and a few more
    – They mentioned that Spark is 2–3 times faster than Pig. They also built some IntelliJ IDE plugin and other tooling to help their users work with Spark

Day 2

  1. Building Large-Scale Stream Infrastructures Across Multiple Data Centers with Apache Kafka (Jun Rao of Confluent)
    A brilliant talk by one of the authors of Kafka. He gave a systematic overview of issues to keep in mind when your data has to be available in more than one datacenter (in a way that is transparent to producers and/or consumers), and approaches to handle the same.
    The 3 techniques he discussed are: Stretched Cluster, Active/passive and Active/active. For each of them he suggested what to do when the primary DC goes down and when it comes back up. A main deficit of Kafka is that offsets are not preserved across DC’s when replication happens. He proposed to overcome this by adding timestamp based querying of a Kafka topic to get the matching offset (Kafka Improvement proposal — 33)
  2. Real-Time Hadoop: Keys for Success from Streams to Queries (Ted Dunning of Mapr Technologies)
    An opinionated talk from a long timer in the field. Had a lot of thought provoking points but I wish he had elaborated more. His book “Streaming Architecture” is a good read.
     — Microservices go well with streaming applications. It should be trivial to create and setup of streams and associated services.
     — Each microservice should ship with its own private datastore. He gave an analogy that in the olden days you needed sysadmin approval for your app to create a file on disk, and now that’s no longer the case.
     — Kafka as a centralized message bus and microservices for reading and writing to it, with private datastores as needed, is the best architecture. We can live with lack of strict consistency between these private datastores.
     — Use self-describing schemas that can evolve over time
     — Have common API’s that let you talk to multiple systems (for example Kafka Streams). The Linux/Posix API’s gave you lot of functionality and compatibility. But Hadoop has sacrificed that in the name of scalability.
     — This was followed by a pitch for MapR, how it’s compatible with Kafka Streams API. Its C++ implementation helps with perf.
     — API’s matter more than implementation. Building proprietary systems helps you innovate ahead of the community.
  3. The Next Generation of Data Processing & OSS (James Malone of Google)
    and also (from Day 3)
     Apache Beam: A Unified Model for Batch and Streaming Data Processing (Davor Bonaci of Google)
    Both the talks covered similar ground, namely the Google Cloud Dataflow API which has been open sourced as Apache Beam. This API seeks to unify stream and batch processing application logic, and let you choose the mode you want during execution, along with whichever “execution engine” you want it to run on.
     — This is the result of Google building several generation of internal systems over the last decade, and noticing common patterns of usage across them.
     — The API aims to decouple the following 4 concerns that are relevant in any data pipeline:
     i) What is being computed?
     ii) Which events in the stream are appropriate for this computation
     iii) When in processing time is this particular computation over this event set need to be materialized?
     iv) How do late coming data relate to already performed computations (they call this Refinements)
     — I believe this API and approach has a lot of potential to raise the level of abstraction w.r.t data pipelines. Check out either of these talks or the Google Cloud Dataflow paper. See the article series “The world beyond batch” published by these same folks.
  4. Lambda-less Stream Processing @ Scale in LinkedIn (Yi Pan and Kartik Paramasivam of LinkedIn)
    This was a good experience report. It was a self-acknowledged implementation of ideas from the Google Cloud Dataflow model mentioned above. They had to implement this model in the Linkedin Newsfeed and Fraud detection on their Ad Serving engines. The hard issues to tackle were handling late events (how they relate to already processed data), and cross DC replication.

Day 3

  1. Cross-DC Fault-Tolerant ViewFileSystem at Twitter (Gera Shegalov and Ming Ma of Twitter)
    I really liked the idea behind this talk: Abstract away the notion of multiple DC’s by providing a transparent URI scheme that works everywhere, i.e, viewfs://path/to/file will work across all of Twitter clusters.
    The implementation required understanding some core concepts of how HDFS works and tweaking them appropriately. Strong consistency on writes was a hard problem (not sure if they solved it). The HDFS config subsystem is too messy and they decided to create a parallel config system or something, to enable their cross DC configs.
  2. Turning the Stream Processor into a Database: Building Online Applications on Streams (Stephan Ewen, data Artisans)
    This was another great talk that showed how small tweaks to Apache Flink let it perform queries too and not just act as a data store. The Scala API used to demo this part of Flink had a strong resemblance to Google Cloud Dataflow API again.
     — To enable querying, the system has to support storing state associated with a query as it runs (Storm does not have it). They do this in Flink by using a local, embedded RocksDB database in each Flink task manager. He criticized the Yahoo Streaming benchmarks saying they actually showed that the bottleneck was the key-value store and not the querying engines.
  3. Ingest and Stream Processing — What Will You Choose? (Anand Iyer of Cloudera and Pat Patterson of StreamSets)
    This one was Cloudera’s opinionated take on all the various frameworks out there, their pros and cons. It will be a useful reference in case you want to get a bird’s eye view of the space. I’m not adding the specifics here as that will turn out to be just a long list of bullet points, tips and tricks.
     — The talk stoked a minor controversy as he mentioned that Cloudera considers Apache Storm to be a legacy technology at this point(!)
     — He criticized the Google Cloud Dataflow model saying programmatic abstractions tend not to last (as opposed to query languages and UI abstractions I guess?)
     — This set up the last part of the talk, which was a brief demo of StreamSets

Miscellaneous thoughts

  • Facebook was notably absent among presenters (except for a Teradata co-talk on Presto)
  • Intuit presented a good experience report on Day 2 (Intuit Analytics Cloud) that I haven’t written about
  • Occasionally, the talks were bait and switch. You went in for apparent insights from Cloudera/Hortonworks, but got pitched an enterprise vendor product midway through. Not a particularly big deal once you got the hang of it.
  • We needed talks on big data and analytics related project management, skilling and staffing. The vendors pitched their products and the engineers the technology. The project management aspect was not addressed.

Leave a Reply

Your email address will not be published. Required fields are marked *