Building a data analytics platform in an event-driven world

ClearBank largely follows a microservices-based approach where we rely on the transfer of data between various domains using events in an “Event-Carried State Transfer” model. Using these ‘events’ to build our data analytics platform was a good starting point and enabled us to quickly start ingesting data into our platform. However, using events directly introduces some challenges related to re-processability and data accuracy that needed to be addressed. 

Simple architecture 

The simple architecture to build an analytics model on our event data looks like this: 

Diagram 1: Simple architecture layout

Here’s an explanation of each section of the simple architecture layout: 

  • Domain - Responsible for mastering data and for publishing events every time the state of the managed data changes 
  • Eventing service - Responsible for providing a mechanism to exchange data as messages with guarantees of “at least once” delivery 
  • Streaming pipelines - Responsible for transforming and persisting the incoming events into business data models for analytics, machine learning etc. 
  • Model store - Responsible for holding the business models for analytics. 

This simple architecture helps us read, process and shape data for analytics with near real-time performance. But with this architecture, it is difficult to reprocess events and validate data to make sure it’s accurate enough and meets our high data quality standards at ClearBank.

Reprocessing events 

Challenge 

Our simple architecture lacks any ability to reprocess historical data and generate a business model from scratch. Some  reprocessability features we need our architecture to support are:

Backfilling of a model - At times, a new analytical business model is needed which requires not just new incoming events, but also historic events. This is not possible by only creating a new “streaming pipeline” for that model. 

Re-populate a model - In the unlikely event a “buggy” code deployment causes a model to be corrupted, we need the ability to correct the data after the bug is fixed. Again, this is not possible by only having the “streaming pipelinesfor that model. 

Solution 

The solution to the above challenge had two parts: 

Immutable store for raw events - We stop relying on the “eventing service” for the historical view of events and build our own immutable store. This is achieved by adding a new data streaming processing pipeline called the “ingestion streaming pipeline” that is responsible for off-loading events from the “eventing service” and storing it in an immutable (append only) store. Any further processing of data by the “streaming pipelines” is done by streaming from this immutable store – not from the “eventing service” directly. 

Repopulate with pipelines - Now that we have all our data being persisted in our immutable store, we want the ability to reprocess large amounts of historic data quickly and in a fault-tolerant fashion. This is achieved by creating new short-lived, on-demand processes called “repopulate pipelines”. These processes are created for each of the models. And rather than duplicating logic, these processes are created by configuring the existing “streaming pipelines” to run in batch mode, ensuring maintainability. Since the repopulate, jobs are short-lived processes, we could allocate more resources to them, ensuring speed while keeping the overall cost in check (as seen in diagram 2). 

Diagram 2: Adding the repopulate pipelines and immutable store

Accuracy 

Challenge 

Using events to exchange data can lead to systems getting out of sync, both due to the inherent asynchronous nature of this communication model and due to the addition of another layer of abstraction above the data. Our “simple architecture” not only lacks the ability to validate the data ingested is correct, but it also lacks the ability to ensure that the models generated via the pipelines are accurate. Some of the reasons why the systems could get out of sync are: 

Publishing failures at the domain - In the event a domain fails to publish an event, the data platform would be missing data and be out of sync with the source system (see diagram 3). While this would only happen in exceptional circumstances (e.g. events are wrongly published outside of an atomic transaction), it is critical that we identify any issues before presenting poor quality data to the Data Platform users.  

Diagram 3: Publishing failures at the domain lead to data being inaccurate

Incorrect processing in the pipelines - We must also be prepared for unexpected situations where data is processed incorrectly and written to the “model store”. For example, the“streaming pipelines” might incorrectly ignore events due to their late arrival (see diagram 4).

Diagram 4: Streaming pipelines incorrect processing led to data being inaccurate

Solution 

The solution to the above problem has two parts: 

Aggregated data provided by domains - We need every domain to provide aggregated data based on the state of data within their domain. This is achieved by setting up new processes within each domain (see diagram 5), these processes are responsible for publishing events containing counts and checksum for the data the domain is responsible for. 

Diagram 5: domains publishing aggregate data

Reconciliation pipelines - We then compare the aggregated data from domains against the model data generated by the “streaming pipelines”. This is achieved by creating a “reconciliation pipeline” for every “streaming pipeline”. Since each “streaming pipeline” was responsible for generating its own model, we decided to give it the ability to validate its model for correctness as well (see diagram 6).

Diagram 6: reconciliation check pipelines to validate aggregate domain data

Putting it all together 

The diagram below illustrates the final architecture implemented at ClearBank. When compared to the “simple architecture” (diagram 1), we have introduced an additional event store as well as three new types of pipelines; the ingest, repopulate and reconciliation pipelines.  

Diagram 7: Final architecture with event store, repopulate pipelines and reconciliation pipelines

The below explains each section of the final architecture layout.

  • Domain - Responsible for mastering data and for publishing events every time the state of the managed data changes.
  • Aggregate data publisher process - Responsible for generating an aggregated (counts, checksum) view of the data mastered in the domain.
  • Eventing service - Responsible for providing a mechanism to exchange data as messages with guarantees of at least once delivery.
  • Ingestion pipeline - Responsible for reading events from the eventing service and persisting it to a store.
  • Immutable store - Responsible for holding all the events generated.
  • Streaming pipelines - Responsible for transforming and persisting the incoming events into business data models for analytics, machine learning etc.
  • Repopulate pipelines - Responsible for re-creating an analytical business model from scratch based on the ingested events.
  • Reconciliation pipeline - Responsible for validating a generated analytical model for correctness.
  • Model store - Responsible for holding the business models for analytics.

These additions allow us to continue to leverage the eventing data being published by ClearBank’s domains, while also ensuring we can reprocess data as well as validate the accuracy of the analytics data we were providing.

Conclusion 

If you have any questions, please get in touch with Yusuf Jamali.  

Yusuf Jamali

Senior Software Engineer, ClearBank