Build event-driven data quality pipelines with AWS Glue DataBrew

As businesses collect more and more data to drive core processes like decision making, reporting and machine learning (ML), they continue to be met with difficult hurdles! Ensuring data is fit for use with no missing, malformed or incorrect content is first priority for many of these companies – and that’s where AWS Glue Databrew steps in!

Let’s build a fully automated, end-to-end event driven pipeline for data quality validation.

Now that’s a subtitle that gets us excited! As with any data journey, it’s a rocky road to forming structured, intelligent and readily queryable data.

AWS Glue DataBrew is a visual data preparation tool that makes it easy to find data quality statistics such as duplicate values, missing values and outliers in your data. You can also set up data quality rules in DataBrew to perform conditional checks based on unique business needs.

AWS Infrastructure diagram showing AWS services
High level architecture highlighting AWS step function workflows within AWS.
 
Step by step
 
The solution workflow contains the following steps:
 
1. When you upload new data to your Amazon Simple Storage Service (Amazon S3) bucket, events are sent to EventBridge.
 
2. An EventBridge rule triggers a Step Functions state machine to run.
 
3. The state machine starts a DataBrew profile job, configured with a data quality ruleset and rules. If you’re considering building a similar solution, the DataBrew profile job output location and the source data S3 buckets should be unique. This prevents recursive job runs. We deploy our resources with an AWS CloudFormation template, which creates unique S3 buckets.
 
4. A Lambda function reads the data quality results from Amazon S3 and returns a Boolean response into the state machine. The function returns false if one or more rules in the ruleset fail and returns true if all rules succeed.
 
5. If the Boolean response is false, the state machine sends an email notification with Amazon SNS and the state machine ends in a failed status. If the Boolean response is true, the state machine ends in a succeed status. You can also extend the solution in this step to run other tasks on success or failure. For example, if all the rules succeed, you can send an EventBridge message to trigger another transformation job in DataBrew.
 
You can follow the exact technical steps by visiting the blog post here. AWS have put together a thorough and detailed approach to testing this solution, even providing an AWS Serverless Application Model(AWS SAM) and example code.
 

Metin Alisho, Data Scientist at Firemind says: “Event driven pipelines make up the ‘bread and butter’ of many of our customer solutions. This AWS Glue DataBrew driven solution makes finding quality data between the outliers and duplicate values simple.”

Transforming your data, one column at a time!

These solutions using tools such as AWS Glue DataBrew mark just the beginning in what’s possible for data management and interpretation. Get in touch with us today to look at how we can work with your data.

Get in touch

Want to learn more?

Seen a specific case study or insight and want to learn more? Or thinking about your next project? Drop us a message!