Amazon S3
Set up monitoring of your raw files stored in AWS S3 including S3 tables using Apache Iceberg
Reading S3 Files from Snowflake
Overview
Metaplane can monitor your S3 data by querying through Snowflake's secure integration with AWS. This guide walks you through setting up the connection and querying your S3 data, using a storage integration to manage access credentials securely.
Prerequisites
Before beginning, ensure you have:
- Administrator access to both Snowflake and AWS
- The S3 bucket containing your data files
- Appropriate permissions to create IAM roles in AWS
- The necessary Snowflake privileges to create storage integrations and external tables
Connect Snowflake to AWS
Configure AWS IAM Role
First, create an IAM role that Snowflake can assume to access your S3 bucket. Full details can be found in Snowflake's documentation.
- Create a new IAM role in AWS with a trust relationship to Snowflake's account. You'll need to obtain Snowflake's AWS account ID and external ID after creating the storage integration in the next phase. For now, create a role with the following S3 permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion"
],
"Resource": "arn:aws:s3:::your-bucket-name/*"
},
{
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::your-bucket-name"
}
]
}
Create Snowflake Storage Integration
Create a storage integration object in Snowflake that securely stores the AWS credentials:
CREATE STORAGE INTEGRATION s3_integration
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = S3
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::001234567890:role/your-role-name'
STORAGE_ALLOWED_LOCATIONS = ('s3://your-bucket-name/');
After creating the integration, retrieve the AWS IAM user and external ID:
DESC STORAGE INTEGRATION s3_integration;
Return to AWS and update your IAM role's trust relationship with these values.
Create External Stage
With the integration in place, create an external stage that points to your S3 location:
CREATE STAGE my_s3_stage
STORAGE_INTEGRATION = s3_integration
URL = 's3://your-bucket-name/path/'
FILE_FORMAT = (TYPE = CSV); -- Adjust format as needed
Create External Table
Now you can create an external table to read files directly from S3 using Snowflake. Set the file format to make the raw data queryable in Snowflake:
CREATE FILE FORMAT my_csv_format
TYPE = CSV
SKIP_HEADER = 1
FIELD_DELIMITER = ','
TRIM_SPACE = TRUE;
CREATE EXTERNAL TABLE my_s3_table
LOCATION = @my_s3_stage/your_file.csv
FILE_FORMAT = my_csv_format
Grant Metaplane access to query External Tables
Metaplane's standard permissions cover newly created External Tables, so no additional permissions need to be granted. The permissions controlling this are:
grant SELECT on all external tables in schema [your_schema] to [your_role];
grant SELECT on future external tables in schema [your_schema] to [your_role];
When Metaplane next syncs with Snowflake, find your external tables in your entity tree. Verify this worked by running a SELECT command on the external table within Snowflake as the Metaplane role.
Reading S3 Files from BigQuery Using BigQuery Omni
Overview
Metaplane can monitor your S3 data by querying through BigQuery's Omni functionality without moving it to Google Cloud Storage. This guide walks you through setting up the connection and querying your S3 data.
Prerequisites
- An AWS account with administrator access to create and modify IAM roles
- An S3 bucket containing your data
- BigQuery Omni enabled in your Google Cloud project
- Multi-cloud billing enabled for your BigQuery organization
Connect BigQuery to AWS
Configure AWS IAM Role
The AWS IAM role requires specific permissions and trust relationships to work with BigQuery Omni:
- Required IAM Role Permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
]
}
]
}
- Trust Relationship Configuration:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::GOOGLE_CLOUD_PROJECT_NUMBER:root"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "EXTERNAL_ID_FROM_BIGQUERY"
}
}
}
]
}
Notes on IAM Configuration:
- The
GOOGLE_CLOUD_PROJECT_NUMBER
will be provided by BigQuery Omni during setup - The
EXTERNAL_ID
is generated when creating the BigQuery connection
Enable BigQuery Omni
- Visit the BigQuery Omni page in Google Cloud Console
- Select AWS as your cloud provider
- Click "Enable BigQuery Omni"
- Choose your preferred AWS region
- Note the Google Cloud project number provided during setup
Create and Configure Connection
- Create the AWS connection:
CREATE CONNECTION `your-project.your-region.aws-connection`
OPTIONS(
connection_type = 'AWS',
cloud_region = 'aws-region'
);
-
After creation, BigQuery provides:
- AWS CloudFormation template parameters
- External ID for the trust relationship
- Required IAM permissions
-
Deploy AWS CloudFormation stack:
- Use template parameters from BigQuery
- Verify the created IAM role has the correct permissions
- Ensure the trust relationship is properly configured
-
Update the BigQuery connection with the role ARN:
ALTER CONNECTION `your-project.your-region.aws-connection`
SET OPTIONS(
aws_role = 'YOUR_AWS_ROLE_ARN'
);
Create External Dataset
With the connection established, create a dataset referencing your S3 data:
CREATE SCHEMA `your-project.your-external-dataset`
OPTIONS(
location = "aws-region",
connection = "your-project.your-region.aws-connection"
);
Create External Table
Now you can create an external table to read files directly from S3 using BigQuery.
CREATE EXTERNAL TABLE `your-project.your-external-dataset.your_table`
(
column1 STRING,
column2 INT64,
column3 TIMESTAMP
)
OPTIONS(
format = 'CSV',
uris = ['s3://your-bucket/path/*.csv']
);
Query Your Data
Now you can query your S3 data. Run something like the below as the Metaplane user to verify that you have permission:
SELECT count(*)
FROM `your-project.your-external-dataset.your_table`
LIMIT 10;
Updated 27 days ago