By now, the consensus view of Big Data is that bigger doesn’t translate into better. It’s a more subtle irony that in the face of bigger and faster data, data science is still vexingly difficult. It’s not in the data or in the science; it’s in formulating an approach to data science pipelines.
One of the more challenging applications is using Healthcare data to address chronic disease. At CloudGeometry, we worked with Gali Health to create a data science pipeline architecture for the unique challenges of healthcare data. It required integration across different kinds of healthcare business processes and different points of data collection. The most significant challenge is that the data itself is the subject of constant change. Let’s take a closer look at where this problem comes from, and how a data science pipeline approach helps.
Beyond BI basics in classic healthcare
Let’s look at a highly simplified health care use case
- A patient check in to visits a doctor at a hospital
- The doctor examines the patient and put information in an Electronic Medical Record
- Data capture in the record kicks off Interactions with third-party systems like billing, Insurance, paying the doctor, Etc
- The doctor’s prescription gets sent to the pharmacy where the patient picks it up
At the highest level, this use case spans a well-scoped set of systems. Each solves for a separate set of business problems, and they’re generally related. Drawing on that commonality of data makes it possible to do some analysis across those systems. Data is aggregated in a data warehouse or data lake. Making sense of what happens in a particular business process — such as how many patients a doctor sees — is a classic problem for business intelligence (BI).
However, this approach quickly falls apart in the real world. It’s not just because the systems are inherently complex or costly (and they are). The problem is that neither healthcare nor the patients fit this model. Healthcare does not begin and end with one or more visits to the doctor. This is even more challenging with chronic diseases, as care must be managed because the disease can only be controlled, and not cured.
The modern health care data landscape
The technology marketplace has seen fit to address this unmet need with a huge array of data-driven technologies. These applications address health care problems that end in recovery as well as chronic diseases. The landscape of analytics becomes far more daunting.
The first-order problem is that business processes that were once amenable to classic business intelligence have become much more complex with many specialized sources of instrumentation and interaction. The more challenging element is in the connections between all these systems, and the pattern of many-to-many connections. These same systems now must communicate data to each other.
Within this new ecosystem of bigger faster data in the healthcare ecosystem, the constituent system parts must deal with continuous change on three major levels.
- Infrastructure Changes. The configuration of compute, storage, and networking on which applications run is constantly changing. Various services solve for various problems, and they don’t all change at the same time. Side effects of changes to one piece of infrastructure — e.g. transition from IPv4 to IPv6 — can propagate in unforeseen ways
- Application Changes. Changes in improvements in the logic with which each application solves its isolated problems — as simple as automating a process like checking in to the doctor’s office with a smartphone app — also can propagate in unforeseen ways
- Data Changes. Very few if any Health Care Systems are self-contained. There is always going to be a set of third-party data flows where changes are invisible — until the systems that rely on the data flows are blindsided. Cloud applications have accelerated this phenomenon. Changes to data structure and semantics materialize in unexpected ways, often uncovered only when data quality goes wrong.
The infrastructure management world thinks of #1 and #2 DevOps. Application code written by developers is tested in concert with computer infrastructure managed by operators prior to release and validated continuously.
When one does not control the infrastructure, such as in cases of ingest from third-party applications, it adds a third dimension that has come to be called DataOps. Data flow management tools like StreamSets and TalenD provide an environment in which these three layers of accelerating complexity can be focused on the flow of analytic data.
Advanced analytics & healthcare for data chronic disease management
And then there are chronic diseases don’t fit neatly into the healthcare delivery paradigm. The line between illness and wellness doesn’t neatly divide between visits to the doctor. This is the thesis behind Gali Health, an innovative startup targeting patient-centered care with a data-driven model.
Gali Health aspires to organize data from a wide range of sources it does not own into what amounts to data pipelines focused on the information patients, physicians, and researchers need to consistently improve outcomes.
- Process data aggregation. The FHIR standard describes data formats and elements exchanging EMR, using a modern web-based suite of API technology to facilitate interoperation between legacy health care systems. FHIR provides an alternative to document-centric approaches by directly exposing discrete data elements as services. At Gali Health, we used the FHIR protocols to drive dataflows that integrate basic elements of healthcare like patients, admissions, diagnostic reports, and medications. These are be retrieved and manipulated via their own resource URLs originating on clinical systems. They are ingested through dataflows in depersonalized databases to support research.
- Machine Learning. Health information traditionally has been based on published literature, with observations documented by clinicians. Gali Health turns the process on its head, using updates by each patient, starting from basic health and lifestyle information to more specific clinical information (treatments, diagnosed condition, recent tests, etc.). Machine learning creates models from the data, rather than pre-programmed business processes suited to a clinical-only setting. Those user originated dataflows, even though they are depersonalized, can create models to describe key parameters associated with a specific health context. For example, for individuals affected by Crohn’s disease, it is important to understand the stages of the disease and its subtypes, responses to known treatments and procedures or tests and the types of symptoms. This data gives researchers an ever more granular approach to making sense of disease dynamics outside of the doctor’s office setting.
- Artificial Intelligence. Gali Health’s greatest ambition is to move beyond depersonalized modeling to provide personalized support of each person in their health journey. The Gali AI is based on an in-depth analysis of the experiences of individuals who are dealing with prevalent chronic conditions. It renders a behavioral model including clinical, emotional, psychological and educational components that are designed to provide users with comprehensive and personalized support. All of this information will enable Gali to intelligently support the patient Context of their day-to-day experience to provide personalized support of each person in their health journey.
In each of these scenarios, a key goal is to facilitate the acquisition of data for Analytics for exploratory data investigation, in effect doubling down on data science for scientific goals.
Data pipeline strategy for bigger/faster medical data
The rich many-to-many relationships of the modern medical data ecosystem require a twofold approach to data flow management, centered on automated workflows. It’s a virtuous cycle of pipeline development and dataflow monitoring. Pipeline Automation can be applied to data integration in the following ways:
- Pipeline Development. Configurable connectors and data transformations facilitate reuse of common or standardized pipeline logic. They also speed dataflow design and minimize the need for specialized coding skills.
- Pipeline Deployment. Creating and deploying new pipeline templates and logic pipeline templates into production together with application and configuration changes, certainly subject to the same build and validation constraints
- Pipeline Scaling. Variations in volume, periodic or one time, call for a data pipeline approach that goes hand-in-hand with cloud infrastructure. Its elastic properties of compute, networking, and storage can respond to temporary ebbs and flows without introducing unnecessary costs.
The criticality of healthcare data requires a higher level of reliability in the dataflows. At Gali Health, we built this out by monitoring the performance of data movement continually. It creates the feedback loop required for both agility and high reliability.
- Data Movement performance. Instrument and monitor the entire data movement architecture for throughput, latency and error rates end-to-end or point-to-point.
- Data characteristics Logic that inspects data in-flow can flag changes in the structure of the data (added/deleted fields); the semantics of the data (changed type or meaning); and data values (skew and noise) which can serve as event triggers. Remediation can range from retries to offlining for a development fix
- Dataflow utilization. Making sense of performance and problems, either in data characteristics or resources, can provide valuable feedback for logic in downstream analytic systems. For example, pre-processing in lower-cost data storage can save runtime compute cycles.
In the context of data science, data pipelines need an even greater degree of agility. In more advanced cases, this entails pointing the pipeline efforts at more sophisticated analytic approaches, i.e., ML and AI. But even with more straightforward modeling and transformation, data science as practice across a broad spectrum of sources must be able to adjust to iterative data manipulations and exploration. The science is as much in the connections across bigger and faster data as it is in the data itself.