{"id":21,"date":"2017-12-06T22:32:29","date_gmt":"2017-12-06T22:32:29","guid":{"rendered":"http:\/\/pramodbiligirisblog.home.blog\/2017\/12\/06\/notes-from-hadoop-summit-2016\/"},"modified":"2017-12-06T22:32:29","modified_gmt":"2017-12-06T22:32:29","slug":"notes-from-hadoop-summit-2016","status":"publish","type":"post","link":"https:\/\/www.pramodb.com\/index.php\/2017\/12\/06\/notes-from-hadoop-summit-2016\/","title":{"rendered":"Notes from Hadoop Summit 2016"},"content":{"rendered":"<p>This was my first time attending a Hadoop Summit (thanks to my employer Yahoo for sponsoring!). I had a very positive impression of the event. Talks covered a range of topics and speakers were available for questions and hallway discussions. There were many vendors showcasing their wares as well.<\/p>\n<p>I mostly attended talks related to streaming \/ real time systems as that\u2019s directly related to the kind of work I do.<\/p>\n<p><strong>Vendor stalls<\/strong><\/p>\n<ul>\n<li>There was a lot of selling of Machine Learning based analytics tools and dashboards.<\/li>\n<li>There were quite a few companies selling cloud based ways to get started using big data tools without having to buy on premise infrastructure and hire associated personnel. Good ones: <em>Qubole, Pentaho<\/em>\n<\/li>\n<li>There were a few tools for metadata management and discovery. Good one was <em>Waterline Data<\/em>\n<\/li>\n<li>MapR\u2019s free ebooks are awesome (<em>Hadoop Buyer\u2019s Guide<\/em>, <em>Streaming Architecture<\/em> etc\u200a\u2014\u200a<a href=\"https:\/\/www.mapr.com\/ebooks\/\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/www.mapr.com\/ebooks\/<\/a>)<\/li>\n<\/ul>\n<p>The vendors stalls brought out the heavy competition among data processing tools that can overwhelm anyone new to the space (see Hadoop Buyer\u2019s Guide mentioned above). There was a talk called \u201c<a href=\"https:\/\/www.youtube.com\/watch?v=ZTZDqFZTr9Y\" target=\"_blank\" rel=\"noopener noreferrer\">The data ecosystem is too damn big<\/a>\u201d by Andrew Brust of Datameer that I didn\u2019t attend, but heard people passionately discussing the points made therein. This interview at the conference with the CEO of a new data integration company echoes similar sentiments: <a href=\"https:\/\/www.youtube.com\/watch?v=o9R7iQqQDF4\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/www.youtube.com\/watch?v=o9R7iQqQDF4<\/a><\/p>\n<p><strong>Interesting talks (Chronological)<\/strong><\/p>\n<p>All the talks are listed here (with a brief description when you click on its name), including links to video and slides: <a href=\"http:\/\/hadoopsummit.org\/san-jose\/agenda\/\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/hadoopsummit.org\/san-jose\/agenda\/<\/a> (unfortunately there\u2019s no direct link to each talk).<\/p>\n<p>There is a YouTube playlist as well: <a href=\"https:\/\/www.youtube.com\/playlist?list=PLKnYDs_-dq16K1NH83Bke2dGGUO3YKZ5b\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/www.youtube.com\/playlist?list=PLKnYDs_-dq16K1NH83Bke2dGGUO3YKZ5b<\/a><\/p>\n<p><strong>Day 1<\/strong><\/p>\n<ol>\n<li>\n<strong>End-to-End Processing of 3.7 Million Telemetry Events per Second Using Lambda Architecture<\/strong> (<em>Symantec and Hortonworks)<br \/><\/em>Video: <a href=\"https:\/\/www.youtube.com\/watch?v=3NyTcbIIvxE\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/www.youtube.com\/watch?v=3NyTcbIIvxE<br \/><\/a>Slides: <a href=\"http:\/\/www.slideshare.net\/HadoopSummit\/end-to-end-processing-of-37-million-telemetry-events-per-second-using-lambda-architecture\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/www.slideshare.net\/HadoopSummit\/end-to-end-processing-of-37-million-telemetry-events-per-second-using-lambda-architecture<\/a>\u00a0<br \/>This talk was a run through of a <em>huge<\/em> number of config parameters that these companies tuned to get a system that could process their workload. Apparently it took them 6 months, 20 people from Symantec and many more from Hortonworks to accomplish this. It was a comprehensive ETL pipeline with every framework and querying engine you can think of having a place somewhere in it. Definitely worth going back and looking at again.<\/li>\n<li>\n<strong>File Format Benchmark\u200a\u2014\u200aAvro, JSON, ORC, and Parquet<\/strong> (<em>Owen O\u2019Malley of Hortonworks<\/em>)<br \/>Video: <a href=\"https:\/\/www.youtube.com\/watch?v=tB28rPTvRiI\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/www.youtube.com\/watch?v=tB28rPTvRiI<br \/><\/a>Slides: <a href=\"http:\/\/www.slideshare.net\/HadoopSummit\/file-format-benchmark-avro-json-orc-parquet\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/www.slideshare.net\/HadoopSummit\/file-format-benchmark-avro-json-orc-parquet<\/a>\u00a0<br \/>This one was a rundown of a bunch of benchmarks the speaker performed across the 4 file formats. It was interesting for the quick background he provided about how these various formats came into being. Two of the data sets he used are publicly available and worth looking at: the NY city taxi rides data set, and the Github activity log data set.<br \/>A brief summary of the results:<br \/>&#8211; Compression matters. Apparently, JSON is really bad at getting compressed. Avro puts markers after every 16K. This affects compression window for zlib.<br \/>&#8211; His recommendation was to use ORC with zlib compression by default; Avro is good for complex tables with common strings, when used with Snappy compression.<\/li>\n<li>\n<strong>Netflix: Spark on Yarn for ETL at Petabyte scale<\/strong><br \/>Video: <a href=\"https:\/\/www.youtube.com\/watch?v=85sew9OFaYc\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/www.youtube.com\/watch?v=85sew9OFaYc<br \/><\/a>Slides: <a href=\"http:\/\/www.slideshare.net\/HadoopSummit\/producing-spark-on-yarn-for-etl\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/www.slideshare.net\/HadoopSummit\/producing-spark-on-yarn-for-etl<\/a><br \/>Netflix has decided to let ML people use Spark for their analysis and thus needed to support it as part of their data platform. This talk documented the struggles to get Spark working with Yarn, S3 and Parquet, since those are the other technologies used inside of Netflix.<br \/>&#8211; Some basic numbers: They process 700 billion events per day, which comes to 3 Petabytes per day. Their warehouse has 40 Petabytes. Their Yarn cluster is 3000 EC2 nodes on 2 clusters (d2.4xlarge).<br \/>\u200a\u2014\u200aThe talk mentioned all the relevant JIRA\u2019s they filed with various projects as they went through their issues. I will mention a few of them: Coalesce() for RDD based on file size, optimization of UnionRDD for listing parent RDD\u2019s, getting HadoopOutputCommitter to work well with S3, poor performance of broadcast variable seed performance, incorrect locality optimization on S3, Fixing Spark\u2019s job history server, switching to G1 GC\u2026 and a few more<br \/>&#8211; They mentioned that Spark is 2\u20133 times faster than Pig. They also built some IntelliJ IDE plugin and other tooling to help their users work with Spark<\/li>\n<\/ol>\n<p><strong>Day 2<\/strong><\/p>\n<ol>\n<li>\n<strong>Building Large-Scale Stream Infrastructures Across Multiple Data Centers with Apache Kafka<\/strong> (<em>Jun Rao of Confluent<\/em>)<br \/>Video: <a href=\"https:\/\/www.youtube.com\/watch?v=Dvk0cwqGgws\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/www.youtube.com\/watch?v=Dvk0cwqGgws<br \/><\/a>Slides: <a href=\"http:\/\/www.slideshare.net\/HadoopSummit\/building-largescale-stream-infrastructures-across-multiple-data-centers-with-apache-kafka\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/www.slideshare.net\/HadoopSummit\/building-largescale-stream-infrastructures-across-multiple-data-centers-with-apache-kafka<\/a><br \/>A brilliant talk by one of the authors of Kafka. He gave a systematic overview of issues to keep in mind when your data has to be available in more than one datacenter (in a way that is transparent to producers and\/or consumers), and approaches to handle the same.<br \/>The 3 techniques he discussed are: Stretched Cluster, Active\/passive and Active\/active. For each of them he suggested what to do when the primary DC goes down and when it comes back up. A main deficit of Kafka is that offsets are not preserved across DC\u2019s when replication happens. He proposed to overcome this by adding timestamp based querying of a Kafka topic to get the matching offset (Kafka Improvement proposal\u200a\u2014\u200a33)<\/li>\n<li>\n<strong>Real-Time Hadoop: Keys for Success from Streams to Queries<\/strong> (<em>Ted Dunning of Mapr Technologies<\/em>)<br \/>Video: <a href=\"https:\/\/www.youtube.com\/watch?v=M1yFgYQ4r6I\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/www.youtube.com\/watch?v=M1yFgYQ4r6I<br \/><\/a>Slides: <a href=\"http:\/\/www.slideshare.net\/HadoopSummit\/realtime-hadoop-the-ideal-messaging-system-for-hadoop\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/www.slideshare.net\/HadoopSummit\/realtime-hadoop-the-ideal-messaging-system-for-hadoop<\/a><br \/>An opinionated talk from a long timer in the field. Had a lot of thought provoking points but I wish he had elaborated more. His book \u201cStreaming Architecture\u201d is a good read.<br \/>\u200a\u2014\u200aMicroservices go well with streaming applications. It should be trivial to create and setup of streams and associated services.<br \/>\u200a\u2014\u200aEach microservice should ship with its own private datastore. He gave an analogy that in the olden days you needed sysadmin approval for your app to create a file on disk, and now that\u2019s no longer the case.<br \/>\u200a\u2014\u200aKafka as a centralized message bus and microservices for reading and writing to it, with private datastores as needed, is the best architecture. We can live with lack of strict consistency between these private datastores.<br \/>\u200a\u2014\u200aUse self-describing schemas that can evolve over time<br \/>\u200a\u2014\u200aHave common API\u2019s that let you talk to multiple systems (for example Kafka Streams). The Linux\/Posix API\u2019s gave you lot of functionality and compatibility. But Hadoop has sacrificed that in the name of scalability.<br \/>\u200a\u2014\u200aThis was followed by a pitch for MapR, how it\u2019s compatible with Kafka Streams API. Its C++ implementation helps with perf.<br \/>\u200a\u2014\u200aAPI\u2019s matter more than implementation. Building proprietary systems helps you innovate ahead of the community.<\/li>\n<li>\n<strong>The Next Generation of Data Processing &amp; OSS<\/strong> (<em>James Malone of Google<\/em>)<br \/>Video: <a href=\"https:\/\/www.youtube.com\/watch?v=_mJYI8A0luA\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/www.youtube.com\/watch?v=_mJYI8A0luA<br \/><\/a>Slides: <a href=\"http:\/\/www.slideshare.net\/HadoopSummit\/the-next-generation-of-data-processing-and-open-source\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/www.slideshare.net\/HadoopSummit\/the-next-generation-of-data-processing-and-open-source<br \/><\/a> and also (from Day 3)<br \/>\u00a0<strong>Apache Beam: A Unified Model for Batch and Streaming Data Processing<\/strong> (<em>Davor Bonaci of Google<\/em>)<br \/>Video: <a href=\"https:\/\/www.youtube.com\/watch?v=7DZ8ONmeP5A\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/www.youtube.com\/watch?v=7DZ8ONmeP5A<br \/><\/a>Slides: <a href=\"http:\/\/www.slideshare.net\/HadoopSummit\/apache-beam-a-unified-model-for-batch-and-stream-processing-data\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/www.slideshare.net\/HadoopSummit\/apache-beam-a-unified-model-for-batch-and-stream-processing-data<\/a><br \/>Both the talks covered similar ground, namely the Google Cloud Dataflow API which has been open sourced as Apache Beam. This API seeks to unify stream and batch processing application logic, and let you choose the mode you want during execution, along with whichever \u201cexecution engine\u201d you want it to run on.<br \/>\u200a\u2014\u200aThis is the result of Google building several generation of internal systems over the last decade, and noticing common patterns of usage across them.<br \/>\u200a\u2014\u200aThe API aims to decouple the following 4 concerns that are relevant in any data pipeline:<br \/>\u00a0i) What is being computed?<br \/>\u00a0ii) Which events in the stream are appropriate for this computation<br \/>\u00a0iii) When in processing time is this particular computation over this event set need to be materialized?<br \/>\u00a0iv) How do late coming data relate to already performed computations (they call this <em>Refinements<\/em>)<br \/>\u200a\u2014\u200aI believe this API and approach has a lot of potential to raise the level of abstraction w.r.t data pipelines. Check out either of these talks or the Google Cloud Dataflow paper. See the article series \u201cThe world beyond batch\u201d published by these same folks.<\/li>\n<li>\n<strong>Lambda-less Stream Processing @ Scale in LinkedIn<\/strong> (<em>Yi Pan and Kartik Paramasivam of LinkedIn<\/em>)<br \/>Video: <a href=\"https:\/\/www.youtube.com\/watch?v=Rrr_dy-Uauc\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/www.youtube.com\/watch?v=Rrr_dy-Uauc<br \/><\/a>Slides: <a href=\"http:\/\/www.slideshare.net\/HadoopSummit\/lambdaless-stream-processing-scale-in-linkedin\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/www.slideshare.net\/HadoopSummit\/lambdaless-stream-processing-scale-in-linkedin<\/a><br \/>This was a good experience report. It was a self-acknowledged implementation of ideas from the Google Cloud Dataflow model mentioned above. They had to implement this model in the Linkedin Newsfeed and Fraud detection on their Ad Serving engines. The hard issues to tackle were handling late events (how they relate to already processed data), and cross DC replication.<\/li>\n<\/ol>\n<p><strong>Day 3<\/strong><\/p>\n<ol>\n<li>\n<strong>Cross-DC Fault-Tolerant ViewFileSystem at Twitter<\/strong> (<em>Gera Shegalov and Ming Ma of Twitter<\/em>)<br \/>Video: <a href=\"https:\/\/www.youtube.com\/watch?v=cUgfor9vgIM\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/www.youtube.com\/watch?v=cUgfor9vgIM<br \/><\/a>Slides: <a href=\"http:\/\/www.slideshare.net\/HadoopSummit\/crossdc-faulttolerant-viewfilesystem-twitter\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/www.slideshare.net\/HadoopSummit\/crossdc-faulttolerant-viewfilesystem-twitter<\/a><br \/>I really liked the idea behind this talk: Abstract away the notion of multiple DC\u2019s by providing a transparent URI scheme that works everywhere, i.e, viewfs:\/\/path\/to\/file will work across all of Twitter clusters.<br \/>The implementation required understanding some core concepts of how HDFS works and tweaking them appropriately. Strong consistency on writes was a hard problem (not sure if they solved it). The HDFS config subsystem is too messy and they decided to create a parallel config system or something, to enable their cross DC configs.<\/li>\n<li>\n<strong>Turning the Stream Processor into a Database: Building Online Applications on Streams<\/strong> (<em>Stephan Ewen, data Artisans<\/em>)<br \/>Video: <a href=\"https:\/\/www.youtube.com\/watch?v=odpRIZVNadQ\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/www.youtube.com\/watch?v=odpRIZVNadQ<br \/><\/a>Slides: <a href=\"http:\/\/www.slideshare.net\/HadoopSummit\/the-stream-processor-as-a-database-apache-flink\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/www.slideshare.net\/HadoopSummit\/the-stream-processor-as-a-database-apache-flink<\/a><br \/>This was another great talk that showed how small tweaks to Apache Flink let it perform queries too and not just act as a data store. The Scala API used to demo this part of Flink had a strong resemblance to Google Cloud Dataflow API again.<br \/>\u200a\u2014\u200aTo enable querying, the system has to support storing state associated with a query as it runs (Storm does not have it). They do this in Flink by using a local, embedded RocksDB database in each Flink task manager. He criticized the Yahoo Streaming benchmarks saying they actually showed that the bottleneck was the key-value store and not the querying engines.<\/li>\n<li>\n<strong>Ingest and Stream Processing\u200a\u2014\u200aWhat Will You Choose?<\/strong> (<em>Anand Iyer of Cloudera and Pat Patterson of StreamSets<\/em>)<br \/>Video: <a href=\"https:\/\/www.youtube.com\/watch?v=_A5ugvLZJNM\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/www.youtube.com\/watch?v=_A5ugvLZJNM<br \/><\/a>Slides: <a href=\"http:\/\/www.slideshare.net\/HadoopSummit\/ingest-and-stream-processing-what-will-you-choose\" target=\"_blank\" rel=\"noopener noreferrer\">http:\/\/www.slideshare.net\/HadoopSummit\/ingest-and-stream-processing-what-will-you-choose<\/a><br \/>This one was Cloudera\u2019s opinionated take on all the various frameworks out there, their pros and cons. It will be a useful reference in case you want to get a bird\u2019s eye view of the space. I\u2019m not adding the specifics here as that will turn out to be just a long list of bullet points, tips and tricks.<br \/>\u200a\u2014\u200aThe talk stoked a minor controversy as he mentioned that Cloudera considers Apache Storm to be a legacy technology at this point(!)<br \/>\u200a\u2014\u200aHe criticized the Google Cloud Dataflow model saying programmatic abstractions tend not to last (as opposed to query languages and UI abstractions I guess?)<br \/>\u200a\u2014\u200aThis set up the last part of the talk, which was a brief demo of StreamSets<\/li>\n<\/ol>\n<p><strong>Miscellaneous thoughts<\/strong><\/p>\n<ul>\n<li>Facebook was notably absent among presenters (except for a Teradata co-talk on Presto)<\/li>\n<li>Intuit presented a good experience report on Day 2 (Intuit Analytics Cloud) that I haven\u2019t written about<\/li>\n<li>Occasionally, the talks were bait and switch. You went in for apparent insights from Cloudera\/Hortonworks, but got pitched an enterprise vendor product midway through. Not a particularly big deal once you got the hang of it.<\/li>\n<li>We needed talks on big data and analytics related project management, skilling and staffing. The vendors pitched their products and the engineers the technology. The project management aspect was not addressed.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>This was my first time attending a Hadoop Summit (thanks to my employer Yahoo for sponsoring!). I had a very positive impression of the event. Talks covered a range of topics and speakers were available for questions and hallway discussions. There were many vendors showcasing their wares as well. I mostly attended talks related to [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":""},"categories":[1],"tags":[5,14],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/www.pramodb.com\/index.php\/wp-json\/wp\/v2\/posts\/21"}],"collection":[{"href":"https:\/\/www.pramodb.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.pramodb.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.pramodb.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.pramodb.com\/index.php\/wp-json\/wp\/v2\/comments?post=21"}],"version-history":[{"count":0,"href":"https:\/\/www.pramodb.com\/index.php\/wp-json\/wp\/v2\/posts\/21\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.pramodb.com\/index.php\/wp-json\/wp\/v2\/media?parent=21"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.pramodb.com\/index.php\/wp-json\/wp\/v2\/categories?post=21"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.pramodb.com\/index.php\/wp-json\/wp\/v2\/tags?post=21"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}