Amazon S3

Set up monitoring of your raw files stored in AWS S3 including S3 tables using Apache Iceberg

Reading S3 Files from Snowflake

Overview

Metaplane can monitor your S3 data by querying through Snowflake's secure integration with AWS. This guide walks you through setting up the connection and querying your S3 data, using a storage integration to manage access credentials securely.

Prerequisites

Before beginning, ensure you have:

  • Administrator access to both Snowflake and AWS
  • The S3 bucket containing your data files
  • Appropriate permissions to create IAM roles in AWS
  • The necessary Snowflake privileges to create storage integrations and external tables

Connect Snowflake to AWS

Configure AWS IAM Role

First, create an IAM role that Snowflake can assume to access your S3 bucket. Full details can be found in Snowflake's documentation.

  1. Create a new IAM role in AWS with a trust relationship to Snowflake's account. You'll need to obtain Snowflake's AWS account ID and external ID after creating the storage integration in the next phase. For now, create a role with the following S3 permissions:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": "arn:aws:s3:::your-bucket-name/*"
        },
        {
            "Effect": "Allow",
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::your-bucket-name"
        }
    ]
}

Create Snowflake Storage Integration

Create a storage integration object in Snowflake that securely stores the AWS credentials:

CREATE STORAGE INTEGRATION s3_integration
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = S3
  ENABLED = TRUE
  STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::001234567890:role/your-role-name'
  STORAGE_ALLOWED_LOCATIONS = ('s3://your-bucket-name/');

After creating the integration, retrieve the AWS IAM user and external ID:

DESC STORAGE INTEGRATION s3_integration;

Return to AWS and update your IAM role's trust relationship with these values.

Create External Stage

With the integration in place, create an external stage that points to your S3 location:

CREATE STAGE my_s3_stage
  STORAGE_INTEGRATION = s3_integration
  URL = 's3://your-bucket-name/path/'
  FILE_FORMAT = (TYPE = CSV);  -- Adjust format as needed

Create External Table

Now you can create an external table to read files directly from S3 using Snowflake. Set the file format to make the raw data queryable in Snowflake:

CREATE FILE FORMAT my_csv_format
    TYPE = CSV
    SKIP_HEADER = 1
    FIELD_DELIMITER = ','
    TRIM_SPACE = TRUE;

CREATE EXTERNAL TABLE my_s3_table
		LOCATION = @my_s3_stage/your_file.csv
    FILE_FORMAT = my_csv_format

Grant Metaplane access to query External Tables

Metaplane's standard permissions cover newly created External Tables, so no additional permissions need to be granted. The permissions controlling this are:

grant SELECT on all external tables in schema [your_schema] to [your_role];  
grant SELECT on future external tables in schema [your_schema] to [your_role];

When Metaplane next syncs with Snowflake, find your external tables in your entity tree. Verify this worked by running a SELECT command on the external table within Snowflake as the Metaplane role.


Reading S3 Files from BigQuery Using BigQuery Omni

Overview

Metaplane can monitor your S3 data by querying through BigQuery's Omni functionality without moving it to Google Cloud Storage. This guide walks you through setting up the connection and querying your S3 data.

Prerequisites

  • An AWS account with administrator access to create and modify IAM roles
  • An S3 bucket containing your data
  • BigQuery Omni enabled in your Google Cloud project
  • Multi-cloud billing enabled for your BigQuery organization

Connect BigQuery to AWS

Configure AWS IAM Role

The AWS IAM role requires specific permissions and trust relationships to work with BigQuery Omni:

  1. Required IAM Role Permissions:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion",
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::your-bucket-name",
                "arn:aws:s3:::your-bucket-name/*"
            ]
        }
    ]
}
  1. Trust Relationship Configuration:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::GOOGLE_CLOUD_PROJECT_NUMBER:root"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "EXTERNAL_ID_FROM_BIGQUERY"
                }
            }
        }
    ]
}

Notes on IAM Configuration:

  • The GOOGLE_CLOUD_PROJECT_NUMBER will be provided by BigQuery Omni during setup
  • The EXTERNAL_ID is generated when creating the BigQuery connection

Enable BigQuery Omni

  1. Visit the BigQuery Omni page in Google Cloud Console
  2. Select AWS as your cloud provider
  3. Click "Enable BigQuery Omni"
  4. Choose your preferred AWS region
  5. Note the Google Cloud project number provided during setup

Create and Configure Connection

  1. Create the AWS connection:
CREATE CONNECTION `your-project.your-region.aws-connection`
OPTIONS(
  connection_type = 'AWS',
  cloud_region = 'aws-region'
);
  1. After creation, BigQuery provides:

    • AWS CloudFormation template parameters
    • External ID for the trust relationship
    • Required IAM permissions
  2. Deploy AWS CloudFormation stack:

    • Use template parameters from BigQuery
    • Verify the created IAM role has the correct permissions
    • Ensure the trust relationship is properly configured
  3. Update the BigQuery connection with the role ARN:

ALTER CONNECTION `your-project.your-region.aws-connection`
SET OPTIONS(
  aws_role = 'YOUR_AWS_ROLE_ARN'
);

Create External Dataset

With the connection established, create a dataset referencing your S3 data:

CREATE SCHEMA `your-project.your-external-dataset`
OPTIONS(
  location = "aws-region",
  connection = "your-project.your-region.aws-connection"
);

Create External Table

Now you can create an external table to read files directly from S3 using BigQuery.

CREATE EXTERNAL TABLE `your-project.your-external-dataset.your_table`
(
  column1 STRING,
  column2 INT64,
  column3 TIMESTAMP
)
OPTIONS(
  format = 'CSV',
  uris = ['s3://your-bucket/path/*.csv']
);

Query Your Data

Now you can query your S3 data. Run something like the below as the Metaplane user to verify that you have permission:

SELECT count(*)
FROM `your-project.your-external-dataset.your_table`
LIMIT 10;