Building a data processing pipeline is one of the most common problem statements, for which you would have written small scripts or built a full-fledged scalable system based on the amount, and frequency of data. In this article, we will talk about the idea of event-driven scalability, the backbone that will be cost-optimized, and requires a minimum amount of development and operations.
Why build an event-driven and scalable data processing pipeline?
While working with startups or building a team project or a personal project, that requires a pipeline for data processing, there is always a constraint of cost. There are 2 ways to tackle this:
Problem Statement Examples:
We will go with a simple example for building a metadata extraction system for e-commerce products. The metadata then is used to provide e-commerce solutions offered by Streamoid. The problem statement is that the client will provide their data. Metadata needs to be extracted, in order for internal services to work. Just for easy understanding, the internal systems need standard data to function, while the data received from clients is raw and may not adhere to the standards. So this calls for a data processing pipeline to convert the raw data to standard metadata that is understood well by the internal systems.
This metadata will be used by various systems across the organization to build several products such as (similar and outfitter recommendations), provide marketplace equivalent data (Amazon, Myntra, Tatacliq, etc.), train multiple classifiers, understand trends and behavior.
Let’s see an example of the extraction of metadata for an e-commerce product:
Looks simple? Yes, the task is simple:
The challenges come when we combine this data processing pipeline with a business use case:
Before we move ahead, let’s see the GIF of the architecture’s backbone, which we will be covering in this article. This is for the 2 processes (Text and Image Classification) in the data processing pipeline.
The below diagram shows 2 iterations, one is for scaling when processing 10 products are there, and the other when 100,000 products are there.
In this architecture we have used:
The diagram can be divided into 4 major components. Let’s discuss each component in detail.
A no-SQL shared database is used, to handle easy scalability for a huge number of read and write operations. As you can observe the database here is just like a plugin, if read and write need to happen to separate or multiple databases, that can be easily plugged in with Push and Pull services which also is deployed on Cloud Run.
An open-source message broker. There are other message brokers as well such as Kafka, or cloud-bound message brokers such as pub-sub, etc. But the reason to choose rabbitmq:
Other message brokers could also be used. The choice of message broker depends on the use case.
Google Kubernetes with KEDA
Google Kubernetes cluster is used here. Here, we have used an autopilot cluster, which means that there is no need to assign nodes to a cluster for autoscaling. The resources scaling is managed by the cluster itself that will create and remove new pods.
We have used KEDA (Kubernetes Event-Driven Scaling). The reason to do this is rabbitmq. If scaling needs to happen on memory/CPU usage or HTTP hits, then scaling is easy. But here scaling needs to happen based on the number of messages in the queue. Also, rabbitmq will not send an invocation, it will send messages to already connected consumers. To tackle this challenge we used KEDA, which will keep track of the number of messages in the queue and accordingly bring up new pods (rabbitmq consumers) in the Google Autopilot Clusters.
Google Cloud Run
Cloud run is used here to handle all the heavy computing such as classifiers. This scales down to 0 and scales based on HTTP invocations. In any pipeline, the major cost incurs from hiring the servers for heavy compute processing and keeping it up even while there is no usage for it. With Cloud Run, you can deploy your heavy compute services as a containerized solution, but only pay when you are using it. There are zero amount of operations involved once you deploy it. Here, we can use an Amazon Elastic Container Service (ECS) which is comparable to Cloud Run or if the GPU requirement is there, we can build our service on a GPU node. Basically, it is dependent on the service compute requirements. We use all three GCR, ECS, and GPU nodes based on requirements.
I would also like to mention that Cloud Run is probably the best cloud solution among all the cloud scaling solutions provided. The primary reason is it has all the functionality and is developer-friendly with the deployment of literally 2 lines of code.
Also, the limitations of this kind of design are on scaling the heavy computing resources.
Why build this kind of architecture?
This architecture represents the backbone and the idea of having event-driven scaling. The actual systems are much more complex than this.
Let us know your thoughts on this.