Using TypeScript AWS CDK to Backup Snowflake Data in S3

Lars Cornelissen • Follow

CEO at Datastudy.nl, Data Engineer at Alliander N.V.

4 min read

Introduction to AWS CDK and Snowflake

If you're like me, you've probably heard the buzz around terms like AWS CDK and Snowflake and wondered what all the hype is about. Well, wonder no more! Let's dive into what makes these tools so powerful and why they're making waves in the tech world.

First off, AWS Cloud Development Kit (CDK) is an open-source software development framework by AWS. The AWS CDK lets you define cloud infrastructure in code and provision it using AWS CloudFormation. Imagine building your cloud infrastructure with the same elegance and power that you use to build your applications. Instead of clicking through a web console, you can write straightforward code in programming languages like Python, JavaScript, or TypeScript to define AWS resources and create complex architectures. It's a bit like having a magic wand, but for cloud development!

On the other hand, Snowflake is a cloud-based data warehousing platform. This isn't just any data warehouse; it's designed for speed and scale. It operates entirely on cloud infrastructure, meaning it can store and analyze vast amounts of data quickly and efficiently. Snowflake’s architecture – which separates storage and compute – allows you to scale these resources independently, making it highly cost-effective. Think of Snowflake as your go-to platform for all things data, from storage to analytics. It’s perfect for data-driven applications and enterprise-level data management.

Now, why are these tools important? Well, data is the backbone of modern business, and ensuring data security through robust backup plans is crucial. AWS CDK and Snowflake make it easier to manage, deploy, and back up your data infrastructure. With CDK, you can automate the deployment of data management resources, and with Snowflake, you can scale and manage data effortlessly.

What to Expect

In this blog, I'll cover how to get started with AWS CDK, including setting up your development environment and writing your first stack. We’ll also explore Snowflake, showcasing how to load data, run queries, and ensure data backup. Whether you're a developer just getting started or a seasoned professional looking to upgrade your skills, there's something here for you.

Stay tuned, and let's make cloud development and data management a breeze! And don’t worry, if I can do this, anyone can – I once forgot to save a document for hours, and we know how that ends. Trust me, you’re in good company here.

Setting Up Your Environment

Getting your development environment set up can feel like a daunting task, but trust me, it's simpler than it seems. Let's walk through everything you need to get started with AWS CDK and Snowflake. By the end of this, you'll feel like a wizard who has just crafted their first spell. Ready? Let's go!

Installing TypeScript

First things first, we need to install TypeScript. If you haven't used TypeScript before, don't worry—it's just a superset of JavaScript that adds static types. It's particularly useful for catching errors early.

Steps to Install TypeScript:

Install Node.js: Before you can install TypeScript, you need Node.js. You can download it from here.
Open a terminal: Once Node.js is installed, open a terminal or command prompt.
Run the following command:

npm install -g typescript

Boom! You've got TypeScript installed. To confirm, you can run tsc --version to check the version installed.

Installing AWS CDK

Next, let's install AWS CDK. This will allow us to define our cloud infrastructure using TypeScript.

Steps to Install AWS CDK:

Open your terminal.
Run the following command:

npm install -g aws-cdk

You can confirm the installation by running cdk --version. Now, you have the AWS CDK installed and ready to go.

Setting Up AWS Access

Now that you have AWS CDK installed, you need to configure your AWS CLI to allow CDK to interact with your AWS account.

Steps to Configure AWS CLI:

Install AWS CLI: If you haven't already, you can download and install the AWS CLI from here.
Open a terminal.
Run the following command to configure your AWS credentials:

aws configure

You'll be prompted to enter your AWS Access Key ID, Secret Access Key, region, and output format. Fill these details in carefully. Voila! Your AWS CLI is now configured to interact with AWS services.

Setting Up Snowflake Account

Lastly, let's set up access to your Snowflake account. This allows you to load data and run queries directly from your development environment.

Steps to Set Up Snowflake:

Sign up for Snowflake: If you don't have a Snowflake account, you can sign up for one here.
Create a Role: Log in to your Snowflake account, navigate to the 'Admin' section, and create a new role.
Generate API Tokens: Go to the 'Profile' section and generate the appropriate API tokens for programmatic access.
Configure a Connection: Use a Python library like Snowflake Connector to establish a connection. You can install it by running:

pip install snowflake-connector-python

Here’s a basic setup example:

import snowflake.connector

conn = snowflake.connector.connect(
    user='YOUR_USER',
    password='YOUR_PASSWORD',
    account='YOUR_ACCOUNT'
)

And there you have it! You've set up your development environment.

By now, we've covered a lot, but trust me, this is the toughest part. Everything from here on gets more exciting! And remember, if I can do this, you absolutely can too. I once tried to copy-paste commands and ended up pasting them in a chat with my manager. Let’s just say, I’m no stranger to these little mishaps.

Creating an S3 Bucket with AWS CDK

So you've got your environment set up and you're ready to define some cloud infrastructure. Let's create an S3 bucket using AWS CDK. It's simpler than you might think, and I’ll guide you through the whole process step by step.

First things first, let’s understand what we are going to do. An S3 bucket is a foundational AWS service used for storing data. Whether it’s application logs, backups, or static files for a website, S3 can handle it. Now, let’s create one using AWS CDK with TypeScript!

Initialize a CDK Project

Before we dive into code, we need to initialize a new CDK project. Open your terminal and navigate to the directory where you want your project to live. Then run:

cdk init app --language typescript

This command sets up a boilerplate CDK application with the necessary directory structure and configuration.

Define the S3 Bucket in TypeScript

Next, navigate to the lib directory and open the TypeScript file where we will define our S3 bucket (usually named <project-name>-stack.ts). Add the following code:

import * as cdk from '@aws-cdk/core';
import * as s3 from '@aws-cdk/aws-s3';

export class MyFirstBucketStack extends cdk.Stack {
  constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Define the S3 bucket
    const bucket = new s3.Bucket(this, 'MyFirstBucket', {
      versioned: true,
      removalPolicy: cdk.RemovalPolicy.DESTROY,  // NOT recommended for production code
      autoDeleteObjects: true  // NOT recommended for production code
    });
  }
}

Explanation of the Code

Import Modules: We import necessary modules from AWS CDK and AWS S3 libraries.
Class Definition: MyFirstBucketStack extends cdk.Stack, which contains the resources for the stack.
Constructor: Initialize the stack by calling super() and passing in the scope, id, and optional properties.
Define Bucket: Using the s3.Bucket class, we define our bucket configuration inside the constructor.
versioned: true: Enables versioning to keep multiple variants of an object.
removalPolicy: cdk.RemovalPolicy.DESTROY: Automatically deletes the bucket when the stack is destroyed. (Not recommended for production).
autoDeleteObjects: true: Automatically deletes all objects stored in the bucket when the bucket is removed. (Also not recommended for production).

Deploying the Stack

Now that our stack is defined, we can deploy it. In your terminal, run:

cdk deploy

This command synthesizes and deploys the stack defined in our code. After a few moments, your S3 bucket should be created and visible in the AWS Management Console.

Worth Noting

Look, I always suggest being cautious with cdk.RemovalPolicy.DESTROY and autoDeleteObjects. You don’t want to accidentally delete critical data, especially in production environments. But for development, it’s incredibly convenient.

If you’ve followed these steps, you should have a fully functional S3 bucket defined through code. It’s really as simple as that! Remember, CDK is powerful yet straightforward. I once spent an hour trying to debug a typo. Don’t worry; it happens to the best of us.

Now, let’s move on to exploring Snowflake and see how these two tools can make data management less of a headache.

Connecting Snowflake to AWS S3

By now, you've got a handle on creating an S3 bucket using AWS CDK. Awesome! Let's take things a step further and connect Snowflake to our S3 bucket. This connection will allow us to load data from AWS S3 into Snowflake, making our data management workflow even more seamless.

Setting Up External Stages in Snowflake

Before we jump into the code, it's important to understand what an external stage is in Snowflake. In simple terms, an external stage points to a location in an external storage system (like AWS S3) from where Snowflake can load data.

Steps to Create an External Stage:

Navigate to Snowflake: Log in to your Snowflake account.
Use a Database: Make sure you have a database where you will create the stage. If not, you can create one using the following SQL command:

CREATE DATABASE my_database;
USE DATABASE my_database;

Create Storage Integration: First, we need to create an integration object that allows Snowflake to access our S3 bucket.

CREATE STORAGE INTEGRATION my_s3_integration
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = S3
  ENABLED = TRUE
  STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::<accountid>:role/<rolename>'
  STORAGE_ALLOWED_LOCATIONS = ('s3://<your-bucket-name>/');

Replace <accountid>, <rolename>, and <your-bucket-name> with your AWS account-specific details.

Create External Stage: Now, we create an external stage pointing to our S3 bucket using the integration object.

CREATE STAGE my_s3_stage
  URL = 's3://<your-bucket-name>/<optional-path>'
  STORAGE_INTEGRATION = my_s3_integration;

Again, replace <your-bucket-name> and <optional-path> as needed.

Granting IAM Role Permissions

For Snowflake to access your S3 bucket, your AWS IAM role must have specific permissions.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::<your-bucket>/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Resource": "arn:aws:iam::<AWS_Account_ID>:role/<Role_Name>"
        }
    ]
}

Replace <your-bucket>, <AWS_Account_ID>, and <Role_Name> with your specific details. This JSON policy allows Snowflake's IAM role to get and put objects in your S3 bucket.

Loading Data into Snowflake

Now that your external stage is set up, you can use it to load data into Snowflake tables.

Example of Loading Data

For demonstration, let’s assume you have a CSV file in your S3 bucket. Here’s how to load it into a Snowflake table:

Create a Table: Create a table in Snowflake to hold your data.

CREATE OR REPLACE TABLE my_table (
    id INT,
    name STRING,
    age INT
);

Load Data from Stage: Use the COPY INTO command to load data from the external stage into the table.

COPY INTO my_table
FROM @my_s3_stage/my_data.csv
FILE_FORMAT = (type = 'CSV', field_delimiter = ',', skip_header = 1);

And just like that, your data is loaded into Snowflake!

Worth Noting

Remember to periodically review the access permissions for your IAM roles and storage integrations to ensure security best practices. The connection between Snowflake and AWS S3 is incredibly powerful, opening up numerous possibilities for streamlined data operations.

If you've followed along, congratulations—you've just connected Snowflake to AWS S3! With this setup, your data pipeline is more flexible and scalable than ever.

Keep going, and the magic of cloud development and data management will continue to unfold. And trust me, if I can juggle all these configurations without pulling my hair out, so can you. (Although I may have a few extra grey hairs now, but that's a story for another time!)

Automating the Backup Process

So far, we've explored how to create S3 buckets, use Snowflake, and set up stages. Now, let's take it a step further by automating the backup process. Automating backups ensures that our data is regularly saved and safe, without us having to lift a finger. We'll achieve this by creating a scheduled AWS Lambda function to export data from Snowflake to S3 using AWS CDK. Trust me, this will save you countless hours in the long run, and quite possibly, a few grey hairs too.

Setting Up the Project

First, ensure you're in your project's root directory where we initialized our CDK project. We'll need to install the aws-lambda and aws-events-targets packages in our CDK project.

npm install @aws-cdk/aws-lambda @aws-cdk/aws-events-targets @aws-cdk/aws-events

Alright, let's get coding!

Writing the Lambda Function

Step 1: Create a Lambda Function Handler

Create a new directory named lambda in your project root. Inside this directory, create a file named index.js.

const snowflake = require('snowflake-sdk');
const AWS = require('aws-sdk');

// Snowflake connection parameters
const connection = snowflake.createConnection({
    account: process.env.SNOWFLAKE_ACCOUNT,
    username: process.env.SNOWFLAKE_USER,
    password: process.env.SNOWFLAKE_PASSWORD
});

exports.handler = async (event) => {
    try {
        await new Promise((resolve, reject) => {
            connection.connect((err, conn) => {
                if (err) {
                    reject(err);
                } else {
                    resolve(conn);
                }
            });
        });
        const sql = 'COPY INTO @my_s3_stage FROM my_table FILE_FORMAT = (type = csv);';
        await new Promise((resolve, reject) => {
            connection.execute({
                sqlText: sql,
                complete: (err, stmt) => {
                    if (err) {
                        reject(err);
                    } else {
                        resolve(stmt);
                    }
                }
            });
        });
        return { statusCode: 200, body: 'Backup completed successfully' };
    } catch (err) {
        return { statusCode: 500, body: JSON.stringify(err) };
    } finally {
        connection.destroy();
    }
};

This Node.js code connects to Snowflake and runs a COPY INTO command to export data to our S3 stage.

Step 2: Define Lambda Function in CDK

Open your TypeScript stack file (usually named <project-name>-stack.ts). We're going to add code to define our Lambda function.

import * as cdk from '@aws-cdk/core';
import * as lambda from '@aws-cdk/aws-lambda';
import * as events from '@aws-cdk/aws-events';
import * as targets from '@aws-cdk/aws-events-targets';
import * as path from 'path';

export class BackupStack extends cdk.Stack {
  constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const backupLambda = new lambda.Function(this, 'BackupLambda', {
      runtime: lambda.Runtime.NODEJS_12_X,
      code: lambda.Code.fromAsset(path.join(__dirname, '/../lambda')),
      handler: 'index.handler',
      environment: {
        SNOWFLAKE_ACCOUNT: process.env.SNOWFLAKE_ACCOUNT!,
        SNOWFLAKE_USER: process.env.SNOWFLAKE_USER!,
        SNOWFLAKE_PASSWORD: process.env.SNOWFLAKE_PASSWORD!
      }
    });

    const rule = new events.Rule(this, 'ScheduleRule', {
      schedule: events.Schedule.rate(cdk.Duration.days(1)),
    });

    rule.addTarget(new targets.LambdaFunction(backupLambda));
  }
}

Explanation of the Code

Import Necessary Modules: We import necessary libraries for our Lambda function, AWS EventBridge for scheduling, and CDK core for stack definitions.
Define Lambda Function: We define a new Lambda function using the lambda.Function construct.
- runtime: Specifies the runtime environment as Node.js 12.x.
- code: Points to the directory where our Lambda function code is located.
- handler: Sets the handler to index.handler, which matches our JavaScript file and function.
- environment: Passes our Snowflake credentials as environment variables.
Schedule the Lambda Function: We create a new EventBridge rule to trigger our Lambda function every day (cdk.Duration.days(1)).

Deploying the Stack

Once you've defined your stack, deploy it using:

cdk deploy

Wait a few moments, and your stack will now include a scheduled Lambda function for automated backups.

Worth Noting

Remember to keep your Snowflake credentials secure. AWS Secrets Manager is a great way to manage such sensitive information.

Congratulations! You've just set up an automated backup process. Look at you, streamlining operations like a pro! And if you think setting up backup automation is challenging, wait until you try to remember if you locked your front door this morning. At least with automation, you won't lose any sleep.

There you have it! An automated backup process that runs without you having to think about it. How awesome is that?

Testing and Validation

Now that we've set up the automated backup process, it's crucial to test and validate it to ensure everything is working smoothly. After all, what good is automation if you can't trust it? Let's break down how you can test and validate your backup process, catch common issues, and ensure data consistency. And hey, if you've come this far, you're already doing great. Let's get even better together!

Testing the Backup Process

Step 1: Triggering the Lambda Function Manually

Before waiting for the scheduled event, you can trigger the Lambda function manually to confirm it works as expected.

Navigate to Lambda Console: Go to the AWS Management Console, navigate to the Lambda service, and find your BackupLambda function.
Trigger the Function: Click on 'Test' and create a new test event. Name it anything you like and leave the JSON part empty. Run the test by clicking 'Test' again.

You should see logs indicating the Lambda function execution. Look for a statusCode: 200 response in the log output. If you see 500, it means an error occurred, and you should check the error message for more details.

Step 2: Check Snowflake Stage

Once the Lambda function runs successfully, you should have data in your Snowflake external stage. You can confirm this by running a query in Snowflake.

LIST @my_s3_stage;

This command lists all files in your specified stage. If you see the expected files, your backup process is working!

Validating Data Consistency

Step 1: Data Row Count Comparison

Before and after the backup process, you can compare row counts between the Snowflake table and the data in S3 to ensure consistency.

Row Count in Snowflake Table:

SELECT COUNT(*) FROM my_table;

Row Count in S3 Backup: Your data may be in CSV format. Download the file and count the rows (excluding the header).
Compare Counts: Ensure the counts match.

Step 2: Data Hashing

For a more robust validation, you can use data hashing to compare the hashes of your original table and your backup.

Hash in Snowflake:

SELECT HASH_AGG(MD5(CONCAT(id, name, age))) FROM my_table;

Hash in Backup: Calculate the hash of your backup file using a script. For example, with Python:

import hashlib

hash_md5 = hashlib.md5()
with open('my_data.csv', 'rb') as f:
    for chunk in iter(lambda: f.read(4096), b""):
        hash_md5.update(chunk)
print(hash_md5.hexdigest())

Check for Matching Hashes

Ensure that the hash values match. If they do, you can be confident in your data's integrity.

Troubleshooting Common Issues

Issue 1: IAM Role Permissions

One frequent problem is insufficient IAM role permissions, causing the Lambda function to fail.

Solution: Double-check your IAM roles and policies. Ensure your roles have the necessary permissions for S3 and assume the appropriate roles.

Issue 2: Snowflake Connection Errors

Connection issues can arise if your Snowflake credentials or account details are incorrect.

Solution: Test your Snowflake connection credentials independently. Use the Snowflake connector directly from your local environment to ensure the details are correct.

Issue 3: Data Format Errors

Errors in loading data often result from mismatched data formats, especially if the export and import file settings don't align.

Solution: Ensure the data format specifications match between your Snowflake COPY INTO command and the actual data format in S3.

Worth Noting

Automation makes life easier, but it demands thorough testing. A successful test build confidence in your backup process. Remember, consistent validation is the key to trust. If you encounter any issues, don't get disheartened. Debugging is just part of the journey.

Congratulations! You now know how to test and validate your backup process. With each step, you're ensuring your automated system is robust and reliable. If I can manage to get these tests running (and trust me, I've had my fair share of "why isn't this working" moments), you can too. Keep pushing forward, and soon, you'll have a seamless, worry-free backup system.

Keep an eye out for more tips and tricks, and happy coding!

Conclusion

We've come a long way in this journey to integrate AWS CDK, Snowflake, and automate our data backup processes. Let's quickly recap the steps we've taken and highlight the benefits of using these powerful tools together.

Steps We Took

To make this whole setup work, here's what we did:

Set Up Your Environment:
Installed TypeScript, AWS CDK, and configured AWS CLI.
Set up a Snowflake account and configured access.
Created an S3 Bucket with AWS CDK:
Initialized a CDK project and defined an S3 bucket in TypeScript.
Deployed the stack to create the bucket.
Connected Snowflake to AWS S3:
Set up an external stage in Snowflake to point to the S3 bucket.
Configured IAM roles with the necessary permissions.
Loaded data from S3 into Snowflake tables.
Automated the Backup Process:
Wrote a Lambda function to export data from Snowflake to S3 using scheduled events.
Deployed and scheduled the Lambda function using AWS CDK.
Tested and Validated:
Manually triggered the Lambda function and validated data consistency.
Troubleshot common issues related to IAM roles, connection errors, and data formats.

Benefits of Using TypeScript and AWS CDK

Infrastructure as Code: Using AWS CDK with TypeScript allows you to define and manage your cloud infrastructure programmatically, which is much easier than manual configuration.
Scalability: Both AWS and Snowflake are designed to scale. By automating backups and data management, you can handle growing datasets effortlessly.
Error Handling: TypeScript’s static types help catch errors early, making your codebase more reliable.
Automation: Automating the data backup process ensures regular backups without manual intervention, thereby increasing reliability and reducing human error.

Encouragement

I encourage you to try this setup in your own projects. The combination of AWS CDK and Snowflake offers a powerful, flexible, and scalable solution for data management and backups. Plus, diving into this realm of automation and cloud-based solutions will bolster your skills and open up new possibilities in your tech journey.

Additional Resources

Here are some resources to help you dive deeper:

AWS CDK Documentation: AWS CDK Developer Guide
Snowflake Documentation: Snowflake External Stages
TypeScript Handbook: TypeScript Documentation
AWS Lambda Documentation: Lambda Functions

Thanks for sticking around and taking this journey with me. Remember, if I can juggle all these configurations and setups, so can you. Every step you take is a step towards mastering cloud development and data management. Happy coding!

TypeScript

AWS CDK

Snowflake

data backup