This is true: big data is cheaper and faster, but harder; data scientists have the skills, and are in great demand. True, but only partially. Yes, data scientists make a huge difference, but if only they have the backing of data engineers.
Here’s a one-sentence definition: a data scientist is someone who uses advanced mathematical and statistical problem-solving skills to tackle difficult business problems using data.
These are smart people who have are lots of supporting skills, principally domain knowledge of the business. But just as science and engineering are unique, complementary domains, data science needs data engineering to succeed.
Can’t data scientists do it all? Yes, they could. The breadth of their statistical skills, mathematical imagination, domain knowledge, academic credentials, intellectual versatility — and, why not, coding chops — are unquestionably applicable to the upstream side of the data supply. And in a small startup with less than a dozen people, having your data scientist build and run the data platform might be a good place to start.
Should your data scientist build and run the data platform? That’s a different question. A good look at the division of labor with data engineering helps explains why. Top-to-bottom, end-to-end system ownership might not be the ideal place to apply the lion’s share of your data science talent.
Data: input to the science of solving problems
A data scientist is only as good as her data. With good data, she can solve problems faster, accelerate algorithmic curiosity, or using machine learning and AI for self-directed modeling in the service of problem-solving.
The complexity and heterogeneity of big data lends itself to systematic, model-oriented thinking. Big data also benefits from a lot of automation, so data scientists have picked up some pretty significant programming skills in their work with big data.
That’s both good and bad news. Business problems don’t stand still, and neither do their data. The data scientist learns by continuously refining her understanding of the business problem. It requires lots of algorithmic sophistication. But no matter what programming techniques she applies to automate advanced analytics, she is primarily engaged in solving business problems.
Of course, not all programming problems are data science problems. This is where the data engineer enters the picture, as a not luxury but a necessity. Let’s take a closer look at the data engineering role in comparison to data scientists.
Figure 1: Skills comparison for data engineers and data scientists
Data Engineers: Software Engineers + Data
The main role of the data engineer is to build a scalable, resilient, reliable distributed system that processes the data required by the business. That part’s not a new job. What is new is the data engineer isn’t just feeding the data directly to the business; it’s his job to feed the data to the data scientist and make it easier for her to rely on the data.
A data engineer has specialized advanced programming skills: construction and operation of distributed systems focused on building a large-scale, well-architected data infrastructure. The engineering challenge is to put together the dozen or two technology subsystems required to make big data work.
Typically, data engineers have deep programming skills in Java, Scala, or Python. That’s not to say that data engineers rely only on these three languages. What matters is how they combine these tools and technologies by
- building a large-scale distributed environment, robust in the face of continuous change
- reliant on distributed asynchronous communication both for apps and infrastructure like networking
- variable latency from third-party services, feeds and APIs
- and composed of many distinct technologies often not designed to work well together.
If that sounds like managing cloud apps and infrastructure, you get the idea. And there’s one more thing.
Engineering for data pipelines
Job 1 for Data Engineering in building an operating a distributed system: Don’t. Screw. Up. The. Data.
Always getting the data right is not a new problem. Back in the day, the complexity of record locking and two-phase commit were hidden beneath the covers of an expensive monolithic single-app transaction processing architecture.
Today, the future of data is multiple data types from various sources with different behaviors, many of which you don’t control. The distributed system has to keep data from getting lost, there’s never any conflict in which transaction is the correct one, like whether the hotel room you canceled should be on your credit card. It has to work despite imprecise control of the timing of the inbound data, reads vs. writes, hardware outages, network contention, undocumented changes in supporting services, and so on.
End-to-end change management is another critical component when change is the one constant in the world of data. It means continuously experimenting with workloads in the distributed public or private cloud; add test automation, managing promotion to production, and operational monitoring. All of these critical engineering jobs make the difference between happy data scientists and the ones who left you to work elsewhere
These problems are not limited to cloud applications like mobile gaming or streaming music. In the big data/data science world, it’s all about building Data Pipelines.
What’s so hard about building data pipelines
Data pipeline? Seems simple enough:
- (1) Ingest. (“Check.”)
- (2) Crunch up with data science. (“Check.”)
- (3) Feed output to the business. (“Check.”)
(Extra credit: Feed it to machine learning).
The reality is this: each data pipeline is a complex, ever-changing distributed application. Here is what makes it complicated:
- Data inputs are always changing.
Data from internal applications is as diverse as the problems they solve for. 3d party applications like APIs and social media feeds are constantly changing. Reliable ingestion for a data pipeline needs to adjust to all those continuously with constant monitoring of data quality and solid operational response procedures to those changes. - “Crunch” is not a single step algorithm.
Processing data does not end with ingest. The application of data across many problem-solving frontiers means constant change and additional steps. Batch processing is an obvious example; real-time processing has many steps between raw data and usable data output. Add the resource footprint of processing, as well as multiple destinations of changes over time (e.g., elastic Dockerized microservices). - The single source of truth drowned in the data lake.
The data pipeline needs to serve multiple consumers. Some of the results go to data scientists in pursuit of algorithmic curiosity. Some of their theories don’t work, and they need changing to in the upstream pipeline. Every business today is data-driven, and Data Engineering supports analytics that may not be directly the province of data science, such as dashboards for end customers, or even executives. - Machine learning and artificial intelligence change the game.
Automating the modeling process either in extracting conclusions (machine learning) or identifying the models in tandem with the recommendations they generate automatically is even more sensitive to robust distributed data infrastructure, and operational discipline in managing adaptive processes place even more emphasis on the distributed systems skills of the data engineer.
Building and operating data pipelines is truly the core competence of the data engineer. With data engineering skills, you have the capacity to solve the unique problems of release management, multi-system synchronization, testing and monitoring, deployment architecture, microservices, recovery from systems failures in real time. And if you’re counting on your data scientists do all that, when exactly are they going to be doing data science?
Beyond data prep to data engineering
In its early stages, any data pipeline can be construed as an exercise in data prep. And of course data scientists are skilled in data preparation. Data prep has always been about enabling the science required to solve problems (well before data science was a thing).
But today, data engineering is a more effective way to do data prep, especially given the scarcity of data science talent. Yes, a data scientist can and should be a part of the exercise of writing code for data pipelines. But just because someone can write software doesn’t mean they should available to code anything any time.
It makes more sense to give the problem of creating a reliable, resilient and dynamic data supply its own specialty. Let that the rare breed of data scientist focus on Job 1: using advanced mathematical and statistical problem-solving skills to tackle big business problems. Backing that effort up with skilled data engineers gets business problems solved faster.