Lessons learnt building data pipelines — 1

It has been a few years since I embarked on what was then trendily called “data engineering”. That title — and the associated work — was an outgrowth of a tendency for computing systems to accumulate vastly more data than used to be the case before, and more importantly, try to extract business value by sifting through it. Nowadays the bar has moved higher in many organizations and competitive advantage lies more in advanced analytics than just being able to shlep data to the right place at the right time. But that has only reduced the challenges in data engineering to a small degree. Analytics demands continue to push the volume, variety and velocity of data being processed and keep engineers like me on their toes. So here are some lessons these years have taught me on building data pipelines.

Define clear semantics
Above all, it’s essential to define clear semantics regarding the execution of jobs in a pipeline, particularly if they are processing information aggregated from multiple systems and catering to different internal customers. A job within a pipeline usually runs at specific times (or other triggers), and processes certain windows of data events. Going beyond this is the issue of handling late arriving events, and handling failures of the job itself. When creating one MapReduce based pipeline at Yahoo, we had a fair amount of discussion on how to separate the two notions of job timestamp and event timestamp within our implementation, and equally importantly, conveying this separation to downstream consumers. Since we were depositing onto HDFS for consumption through Hive, we went the route of creating a partition for both the job and the event timestamps. In some analytics use cases you want to process all recent data you receive; in others you want to examine certain time windows in event space. In the specific case of Hive, having partitions for both job timestamp and event timestamp caters to both these cases. In a traditional database you can expect to just create indexes and be done with it, but I will discuss in the next para why that didn’t suffice.

Pipeline semantics should include failure handling. Jobs can fail for many reasons. There might be infrastructure outages or plain old errors in the logic. Keep in mind that your program is regularly — or continuously, if you are streaming — churning through millions of records per minute. “Edge cases” are no longer an afterthought. In an ideal world all your jobs would be idempotent and downstream consumers would magically know when you decided to re-run a job for any reason. But in reality you’ll sometimes need to communicate failures out-of-band and ask consumers to load some data afresh. At any rate, it helps to have a mechanism where consumers can know that a new batch of data has landed upstream system and can start picking it up. Going back to the example of Hive above, the job scheduler called Oozie is widely used at Yahoo and lets you listen to new Hive partitions as they come into HDFS. This was also the reason why we had a separate partition for event related fields instead of relying on indexes alone.

For a thorough discussion of the semantics of data pipelines, I found this paper by Google to be a great resource: “The Dataflow Model: A practical approach to balancing correctness, latency and cost in massive-scale, unbounded, out-of-order data processing”.

Expect failures and operational issues
Since job failures are so common and expensive, here’s a small list intended to motivate the rest of the discussion:

  • Failing because you didn’t do sufficient testing at scale. I once deployed a significant feature that failed instantly in production because I had not inspected all the necessary graphs in the staging environment.
  • Errors in one part of the system having an outsized impact on an totally different part. I’ve seen that issue in a multi-tenant Kafka cluster for example, where topics are sized independently and cater to entirely different customers.
  • A production scramble attributable to the famous billion dollar mistake by Tony Hoare
  • Integration points with other services and platforms, esp during/after they do an upgrade. Anything from protocol mismatch to different type of compression, security etc. For example, a bug in an iptables config rollout brought down most of our Kafka cluster at one time.

I could cite many others, but it’s more useful to examine how such experiences inform your overall approach to building systems rather than focus on individual incidents excessively. That brings me to the topic of operational staff. To me it looks like there is no clear consensus regarding who should perform these duties and what skills they should possess (see SRE vs Devops — A false distinction?), but personally I have found it useful to have a couple of people whose job it is to reduce operational burden on application developers. When things go south, or when routine blockers arise, people with deep knowledge of how systems interface with each other — and up and down the computing stack — will get them back on track faster compared to specialists who mainly work inside of system components.

But do ensure that these folks are also capable of writing high quality software and are empowered by management to do it. You could say I’m partial towards Google’s SRE model and their approach to eliminating toil. What will differ from the Googles of the world is your ability to use off the shelf software to satisfy a larger chunk of your operational needs.

Focus on software quality
When problems happen, do a root cause analysis and attempt a resolution. But it’s quite possible to paper over design flaws with operational expenses for some time. Repeated patches to the same piece of software can be an indicator of some fundamental problem with it — or its fit for the current use case — that needs a longer term, different kind of effort to resolve. If your software is not failing for obscure reasons — like in this epic investigation by PagerDuty — and can be regularly brought back into shape through trivial fixes, you likely have a quality problem. Whenever you are doing innovative work it’s tempting to assume that the difficulty or novelty of the problem you are solving and the impact your software will have can overcome the need to follow old fashioned software engineering practices. But you — or whoever is saddled with maintaining it — will have to pay the dues later. With that in mind, here are some ways you can improve software quality within data pipelines specifically:

  • Insulate business logic from details of wire formats and marshalling/unmarshalling. Libraries like Thrift, Avro, Protobuf etc provide language specific, type safe bindings, and include versioning of data packets as well! Leverage them to the fullest and avoid “stringly typed” code.
  • Try to hold your deployment and operational code to the same standard as your core business logic. This is a harder task to pull off, but if you are a developer who just lost an hour debugging a trivial issue caused by some quick “script” that actually contains many functions and conditionals, you will fail to see what the said script helped speed up.
  • Think about your testing strategy: are you going to embed some sample data in your codebase, and/or have a full scale replica — or some sampling — of production, and/or run your code against all of yesterday’s data for regressions, and so on. This aspect of data pipelines is quite different from testing application programs and much depends on the specific technologies you are using and the kind of operational / management budget you receive for testing. But do not assume that developers can just mock and fake and stub their way into building robust pipelines!

I’ve avoided talking about specific data processing frameworks, and what I’ve learnt about scaling to handle large workloads. I might write a bit on both of these later, but there’s a reason I chose to lead with semantics, operational issues and software quality. And that’s because I’m slowly coming around to the view that handling scale is just one aspect of building data pipelines and perhaps not even the most significant one. My view might be shaped by when I entered this field, but as far as I know, for almost a decade the IT industry — from hardware vendors to cloud providers to developers of data processing systems — has been working on scalability and has made significant progress. So as long as you don’t make egregiously bad choices you will do fine for the average use case. As for specific technologies, there is enough material out there by vendors themselves, and this interactive map shows us just how vast the space has become now

What isn’t going away, however, is the need to define clear semantics for your business related data flows, the need to engineer quality into the custom parts you’ve built and then managing the interactions between the different systems within your organization, keeping in mind that things may not work as intended.

If you have enjoyed reading this and want to work on such problems, do reach out to me (pramodbiligiri at either yahoo or gmail). I’ve recently joined Uber’s engineering team in India and we’re looking for more people to help us with the challenges we are facing!

Leave a Reply

Your email address will not be published. Required fields are marked *