Modern Data Stack on a Budget: Cost Optimization Strategies

Modern Data Stack on a Budget: Cost Optimization Strategies

Simor Consulting | 24 Jun, 2024 | 07 Mins read

Modern Data Stack on a Budget: Cost Optimization Strategies

Data stack costs scale with usage. Storage, compute, and commercial tools can consume budget quickly without proper management. Startups and mid-sized companies face this problem acutely.

This article covers practical cost optimization across storage, compute, and tools.

Understanding Modern Data Stack Costs

Before diving into optimization strategies, let’s break down where costs typically accumulate in a modern data stack:

  1. Data Storage: Cloud storage, data warehouses, data lakes
  2. Compute Resources: ETL/ELT processing, query execution, analytics processing
  3. Commercial Tools: BI platforms, data integration tools, governance solutions
  4. Personnel: Data engineers, analysts, and scientists
  5. Operational Overhead: Monitoring, maintenance, and troubleshooting

Let’s explore cost-saving approaches for each of these areas.

Data Storage Optimization

Data Tiering Strategies

Implement data tiering to store data according to access patterns:

# Example Terraform configuration for multi-tiered S3 storage
resource "aws_s3_bucket" "data_lake" {
  bucket = "company-data-lake"
}

resource "aws_s3_bucket_lifecycle_configuration" "data_tiering" {
  bucket = aws_s3_bucket.data_lake.id

  rule {
    id = "hot-to-cold-transition"
    status = "Enabled"

    transition {
      days = 30
      storage_class = "STANDARD_IA"  # Infrequent Access tier
    }

    transition {
      days = 90
      storage_class = "GLACIER"  # Archive tier for rarely accessed data
    }

    expiration {
      days = 365  # Delete truly obsolete data
    }
  }
}

Data Compression and Partitioning

Use columnar formats and compression for analytics data:

# Example of writing Parquet files with PyArrow
import pyarrow as pa
import pyarrow.parquet as pq

# Create a PyArrow table with your data
data = {
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'timestamp': [1623456789, 1623456790, 1623456791, 1623456792],
    'value': [10.1, 20.2, 30.3, 40.4]
}
table = pa.Table.from_pydict(data)

# Write as a partitioned Parquet file (partitioning by day)
pq.write_to_dataset(
    table,
    root_path='s3://your-bucket/events',
    partition_cols=['timestamp'],
    compression='snappy'  # Balanced compression ratio and speed
)

Open-Source Data Warehousing

Consider open-source alternatives for smaller workloads:

# Docker Compose setup for a DuckDB-based data warehouse
cat > docker-compose.yml << EOF
version: '3'
services:
  duckdb-server:
    image: duckdb/duckdb
    ports:
      - "3000:3000"
    volumes:
      - ./data:/data
      - ./scripts:/scripts
    command: ["--http", "--listen", "0.0.0.0", "--port", "3000", "/data/warehouse.db"]
EOF

docker-compose up -d

Compute Resource Optimization

Right-Sizing Compute Resources

Dynamically scale compute resources based on actual usage patterns:

# Airflow DAG that uses Spark on-demand with automatic scaling
from airflow import DAG
from airflow.providers.amazon.aws.operators.emr import EmrCreateJobFlowOperator
from airflow.providers.amazon.aws.operators.emr import EmrTerminateJobFlowOperator
from airflow.providers.amazon.aws.sensors.emr import EmrJobFlowSensor

JOB_FLOW_OVERRIDES = {
    "Name": "Data Processing Cluster",
    "ReleaseLabel": "emr-6.4.0",
    "Applications": [{"Name": "Spark"}],
    "Instances": {
        "InstanceGroups": [
            {
                "Name": "Master node",
                "Market": "SPOT",  # Use spot instances for cost savings
                "InstanceRole": "MASTER",
                "InstanceType": "m5.xlarge",
                "InstanceCount": 1,
            },
            {
                "Name": "Core nodes",
                "Market": "SPOT",
                "InstanceRole": "CORE",
                "InstanceType": "m5.xlarge",
                "InstanceCount": 2,
                "AutoScalingPolicy": {
                    "Constraints": {
                        "MinCapacity": 2,
                        "MaxCapacity": 10
                    },
                    "Rules": [
                        # Auto-scaling rules here
                    ]
                }
            },
        ],
        "KeepJobFlowAliveWhenNoSteps": False,
        "TerminationProtected": False,
    },
}

with DAG("data_processing_dag") as dag:
    # Create EMR cluster
    create_emr_cluster = EmrCreateJobFlowOperator(
        task_id="create_emr_cluster",
        job_flow_overrides=JOB_FLOW_OVERRIDES,
        aws_conn_id="aws_default",
    )

    # Add processing steps
    # ...

    # Wait for job completion
    wait_for_completion = EmrJobFlowSensor(
        task_id="wait_for_completion",
        job_flow_id="{{ task_instance.xcom_pull(task_ids='create_emr_cluster', key='return_value') }}",
        aws_conn_id="aws_default",
    )

    # Terminate cluster when complete
    terminate_emr_cluster = EmrTerminateJobFlowOperator(
        task_id="terminate_emr_cluster",
        job_flow_id="{{ task_instance.xcom_pull(task_ids='create_emr_cluster', key='return_value') }}",
        aws_conn_id="aws_default",
    )

    create_emr_cluster >> wait_for_completion >> terminate_emr_cluster

Query Optimization

Optimize queries to reduce processing costs:

-- Instead of this expensive query
SELECT *
FROM large_events_table
WHERE event_date >= DATEADD(day, -90, CURRENT_DATE())

-- Use a more efficient query that leverages partitioning
SELECT
    user_id,
    event_name,
    event_timestamp,
    event_properties
FROM large_events_table
WHERE event_date BETWEEN
    DATE_TRUNC('month', CURRENT_DATE() - INTERVAL '3 months')
    AND CURRENT_DATE()
AND event_name IN ('purchase', 'signup')  -- Filter early
LIMIT 1000  -- Only get what you need

Serverless and Pay-as-You-Go Options

Use serverless options to avoid paying for idle resources:

// Example of an AWS CDK stack for serverless data processing
import * as cdk from "aws-cdk-lib";
import * as lambda from "aws-cdk-lib/aws-lambda";
import * as events from "aws-cdk-lib/aws-events";
import * as targets from "aws-cdk-lib/aws-events-targets";
import * as s3 from "aws-cdk-lib/aws-s3";

export class ServerlessDataProcessingStack extends cdk.Stack {
  constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Input and output buckets
    const inputBucket = new s3.Bucket(this, "InputBucket");
    const outputBucket = new s3.Bucket(this, "OutputBucket");

    // Processing Lambda function
    const processingFunction = new lambda.Function(this, "ProcessingFunction", {
      runtime: lambda.Runtime.PYTHON_3_9,
      handler: "index.handler",
      code: lambda.Code.fromAsset("lambda"),
      memorySize: 1024, // Only pay for what you need
      timeout: cdk.Duration.minutes(5),
      environment: {
        OUTPUT_BUCKET: outputBucket.bucketName,
      },
    });

    // Schedule processing to run only when needed
    new events.Rule(this, "ScheduleRule", {
      schedule: events.Schedule.cron({ hour: "3", minute: "0" }), // Run at 3 AM daily
      targets: [new targets.LambdaFunction(processingFunction)],
    });

    // Grant permissions
    inputBucket.grantRead(processingFunction);
    outputBucket.grantWrite(processingFunction);
  }
}

Commercial Tool Costs

Open-Source Alternatives

Replace expensive commercial tools with open-source alternatives where feasible:

Commercial ToolOpen-Source AlternativeImplementation Notes
Fivetran/StitchAirbytedocker-compose up -d to start a self-managed Airbyte instance
Tableau/LookerMetabase/SupersetHost on modest cloud VMs or containers
SnowflakeClickHouse/DuckDBSuitable for TB-scale analytics at a fraction of the cost
DataDogPrometheus + GrafanaSelf-managed monitoring and alerting

Implementation example for Apache Superset:

# docker-compose.yml for Apache Superset
version: "3"
services:
  redis:
    image: redis:latest
    restart: unless-stopped
    volumes:
      - redis:/data

  db:
    image: postgres:14
    restart: unless-stopped
    environment:
      POSTGRES_DB: superset
      POSTGRES_PASSWORD: superset
      POSTGRES_USER: superset
    volumes:
      - db_data:/var/lib/postgresql/data

  superset:
    image: apache/superset:latest
    restart: unless-stopped
    depends_on:
      - db
      - redis
    environment:
      SUPERSET_SECRET_KEY: your_secret_key_here
      DATABASE_DB: superset
      DATABASE_HOST: db
      DATABASE_PASSWORD: superset
      DATABASE_USER: superset
      DATABASE_PORT: 5432
      REDIS_HOST: redis
      REDIS_PORT: 6379
    volumes:
      - ./superset_config.py:/app/superset_config.py
    ports:
      - "8088:8088"
    command:
      [
        "/bin/sh",
        "-c",
        "superset-init && gunicorn -w 10 --timeout 120 --bind 0.0.0.0:8088 'superset.app:create_app()'",
      ]

volumes:
  redis:
  db_data:

Hybrid Approach for BI

Use a hybrid approach combining lightweight tools for common analyses and more powerful tools for complex tasks:

# Example of generating routine reports with lightweight tools
import pandas as pd
import matplotlib.pyplot as plt
from sqlalchemy import create_engine
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.application import MIMEApplication

# Connect to your data source
engine = create_engine('postgresql://user:password@warehouse:5432/analytics')

# Run a query
df = pd.read_sql("""
    SELECT date_trunc('day', created_at) as day,
           count(*) as num_signups
    FROM users
    WHERE created_at >= current_date - interval '30 days'
    GROUP BY 1
    ORDER BY 1
""", engine)

# Create visualization
plt.figure(figsize=(10, 6))
plt.plot(df['day'], df['num_signups'])
plt.title('Daily Signups - Last 30 Days')
plt.xlabel('Date')
plt.ylabel('Number of Signups')
plt.tight_layout()
plt.savefig('daily_signups.png')

# Email the report
msg = MIMEMultipart()
msg['Subject'] = 'Daily Signups Report'
msg['From'] = 'reports@company.com'
msg['To'] = 'team@company.com'

# Attach the image
with open('daily_signups.png', 'rb') as f:
    attachment = MIMEApplication(f.read(), _subtype='png')
    attachment.add_header('Content-Disposition', 'attachment', filename='daily_signups.png')
    msg.attach(attachment)

# Send email using SMTP
server = smtplib.SMTP('smtp.company.com')
server.send_message(msg)
server.quit()

Consolidate Tools

Audit your tools and eliminate redundancies:

# Tools inventory and cost analysis script
import pandas as pd
import matplotlib.pyplot as plt

# Tool inventory data
tools_data = [
    {"category": "ETL", "name": "Fivetran", "monthly_cost": 2000, "users": 5, "features_used": ["Salesforce", "Postgres"]},
    {"category": "ETL", "name": "Stitch", "monthly_cost": 500, "users": 3, "features_used": ["MongoDB"]},
    {"category": "BI", "name": "Looker", "monthly_cost": 3000, "users": 20, "features_used": ["Dashboards", "Explores"]},
    {"category": "BI", "name": "Tableau", "monthly_cost": 1500, "users": 5, "features_used": ["Custom Reports"]},
    # ... more tools
]

# Convert to DataFrame
df = pd.DataFrame(tools_data)

# Analyze costs by category
category_costs = df.groupby('category')['monthly_cost'].sum().reset_index()
print("Costs by Category:")
print(category_costs)

# Identify redundancies
redundancy_analysis = df.groupby('category').agg({
    'name': lambda x: list(x),
    'monthly_cost': 'sum',
    'users': 'sum'
}).reset_index()

print("\nPotential Redundancies:")
for _, row in redundancy_analysis.iterrows():
    if len(row['name']) > 1:
        print(f"Category: {row['category']}, Tools: {row['name']}, Total Cost: ${row['monthly_cost']}")

# Visualization
plt.figure(figsize=(10, 6))
plt.bar(category_costs['category'], category_costs['monthly_cost'])
plt.title('Monthly Tool Costs by Category')
plt.xlabel('Category')
plt.ylabel('Monthly Cost ($)')
plt.savefig('tool_costs.png')

Operational Cost Optimization

Infrastructure as Code (IaC)

Manage your infrastructure with code to prevent resource sprawl and ensure consistency:

# Pulumi example for creating right-sized database resources
import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";

// Create a parameter store entry to track environment parameters
const dbSizeParameter = new aws.ssm.Parameter("dbSizeParameter", {
    name: "/app/production/db-size",
    type: "String",
    value: "db.t3.medium", // Start conservative, scale as needed
});

// Database instance that reads size from parameter store
const dbInstance = new aws.rds.Instance("database", {
    engine: "postgres",
    instanceClass: dbSizeParameter.value,
    allocatedStorage: 20,
    storageType: "gp2",
    name: "analytics",
    username: "dbadmin",
    password: "PASSWORD_PLACEHOLDER", // Use a secrets manager in practice
    skipFinalSnapshot: true,
    backupRetentionPeriod: 7,
    // Auto-scaling configuration
    maxAllocatedStorage: 100,
    // Performance insights for optimization
    performanceInsightsEnabled: true,
    performanceInsightsRetentionPeriod: 7,
});

// Export the connection endpoint
export const endpoint = dbInstance.endpoint;

Scheduled Scaling

Implement scheduled scaling to reduce resources during off-hours:

# AWS Lambda function to resize RDS instances on a schedule
import boto3
import os
import json

def lambda_handler(event, context):
    rds = boto3.client('rds')

    # Get parameters
    instance_id = os.environ['DB_INSTANCE_ID']
    business_hours_instance_class = os.environ['BUSINESS_HOURS_INSTANCE_CLASS']
    off_hours_instance_class = os.environ['OFF_HOURS_INSTANCE_CLASS']

    # Determine if we're in business hours or off hours
    current_hour = int(event['time'][-5:-3])  # Extract hour from CloudWatch scheduled event
    is_weekend = event['time'][0:3] in ['Sat', 'Sun']

    business_hours = current_hour >= 8 and current_hour < 18 and not is_weekend

    # Get current instance class
    response = rds.describe_db_instances(DBInstanceIdentifier=instance_id)
    current_instance_class = response['DBInstances'][0]['DBInstanceClass']

    # Determine target instance class
    target_instance_class = business_hours_instance_class if business_hours else off_hours_instance_class

    # If instance class needs to change
    if current_instance_class != target_instance_class:
        print(f"Resizing {instance_id} from {current_instance_class} to {target_instance_class}")

        try:
            rds.modify_db_instance(
                DBInstanceIdentifier=instance_id,
                DBInstanceClass=target_instance_class,
                ApplyImmediately=True
            )
            return {
                'statusCode': 200,
                'body': json.dumps(f'Resizing initiated for {instance_id}')
            }
        except Exception as e:
            print(f"Error resizing instance: {str(e)}")
            return {
                'statusCode': 500,
                'body': json.dumps(f'Error: {str(e)}')
            }
    else:
        print(f"No resize needed for {instance_id}, already at {current_instance_class}")
        return {
            'statusCode': 200,
            'body': json.dumps('No resize needed')
        }

Cost Monitoring and Alerting

Implement continuous cost monitoring:

# Python script to monitor and alert on cloud costs
import boto3
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import requests
import json

# Get AWS cost data
def get_aws_costs():
    client = boto3.client('ce')

    end_date = datetime.now().date()
    start_date = end_date - timedelta(days=30)

    response = client.get_cost_and_usage(
        TimePeriod={
            'Start': start_date.isoformat(),
            'End': end_date.isoformat()
        },
        Granularity='DAILY',
        Metrics=['UnblendedCost'],
        GroupBy=[
            {
                'Type': 'DIMENSION',
                'Key': 'SERVICE'
            }
        ]
    )

    return response

# Process and analyze cost data
def analyze_costs(response):
    cost_data = []

    for result in response['ResultsByTime']:
        date = result['TimePeriod']['Start']

        for group in result['Groups']:
            service = group['Keys'][0]
            amount = float(group['Metrics']['UnblendedCost']['Amount'])

            cost_data.append({
                'Date': date,
                'Service': service,
                'Cost': amount
            })

    df = pd.DataFrame(cost_data)

    # Identify anomalies and trends
    service_costs = df.groupby('Service')['Cost'].sum().reset_index().sort_values('Cost', ascending=False)
    daily_costs = df.groupby('Date')['Cost'].sum().reset_index()

    # Calculate rolling average and detect anomalies
    daily_costs['Date'] = pd.to_datetime(daily_costs['Date'])
    daily_costs = daily_costs.sort_values('Date')
    daily_costs['Rolling_Avg'] = daily_costs['Cost'].rolling(window=7).mean()
    daily_costs['Anomaly'] = (daily_costs['Cost'] > daily_costs['Rolling_Avg'] * 1.3)

    anomalies = daily_costs[daily_costs['Anomaly']]

    return {
        'service_costs': service_costs,
        'daily_costs': daily_costs,
        'anomalies': anomalies
    }

# Send alerts if needed
def send_slack_alert(anomalies, webhook_url):
    if len(anomalies) > 0:
        latest_anomaly = anomalies.iloc[-1]

        message = {
            "blocks": [
                {
                    "type": "header",
                    "text": {
                        "type": "plain_text",
                        "text": "⚠️ AWS Cost Alert ⚠️"
                    }
                },
                {
                    "type": "section",
                    "text": {
                        "type": "mrkdwn",
                        "text": f"*Date:* {latest_anomaly['Date'].strftime('%Y-%m-%d')}\n*Cost:* ${latest_anomaly['Cost']:.2f}\n*Expected:* ${latest_anomaly['Rolling_Avg']:.2f}\n*Increase:* {((latest_anomaly['Cost'] / latest_anomaly['Rolling_Avg']) - 1) * 100:.1f}%"
                    }
                }
            ]
        }

        response = requests.post(webhook_url, json=message)
        return response.status_code

    return None

# Main execution
costs = get_aws_costs()
analysis = analyze_costs(costs)

# Generate cost report
plt.figure(figsize=(12, 6))
plt.plot(analysis['daily_costs']['Date'], analysis['daily_costs']['Cost'], label='Daily Cost')
plt.plot(analysis['daily_costs']['Date'], analysis['daily_costs']['Rolling_Avg'], label='7-day Average', linestyle='--')

# Highlight anomalies
if len(analysis['anomalies']) > 0:
    plt.scatter(analysis['anomalies']['Date'], analysis['anomalies']['Cost'], color='red', s=50, label='Anomalies')

    # Send alert
    send_slack_alert(analysis['anomalies'], 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK')

plt.title('AWS Daily Costs - Last 30 Days')
plt.xlabel('Date')
plt.ylabel('Cost ($)')
plt.legend()
plt.tight_layout()
plt.savefig('cost_trends.png')

# Top 5 services by cost
plt.figure(figsize=(10, 6))
top_services = analysis['service_costs'].head(5)
plt.bar(top_services['Service'], top_services['Cost'])
plt.title('Top 5 AWS Services by Cost')
plt.xlabel('Service')
plt.ylabel('Cost ($)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('top_services.png')

Cost-Efficient Data Stack Architecture

Based on the strategies above, here’s a reference architecture for a cost-optimized modern data stack:

# CDK example for a complete, cost-optimized data stack
from aws_cdk import (
    core as cdk,
    aws_ec2 as ec2,
    aws_ecs as ecs,
    aws_ecs_patterns as ecs_patterns,
    aws_rds as rds,
    aws_s3 as s3,
    aws_lambda as lambda_,
    aws_events as events,
    aws_events_targets as targets,
    aws_iam as iam,
)

class CostOptimizedDataStackStack(cdk.Stack):
    def __init__(self, scope: cdk.Construct, construct_id: str, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

        # VPC for all resources
        vpc = ec2.Vpc(self, "DataStackVPC", max_azs=2)

        # Storage Layer - S3 buckets with lifecycle policies
        raw_data_bucket = s3.Bucket(
            self, "RawDataBucket",
            lifecycle_rules=[
                s3.LifecycleRule(
                    transitions=[
                        s3.Transition(
                            storage_class=s3.StorageClass.INFREQUENT_ACCESS,
                            transition_after=cdk.Duration.days(30)
                        ),
                        s3.Transition(
                            storage_class=s3.StorageClass.GLACIER,
                            transition_after=cdk.Duration.days(90)
                        )
                    ],
                    expiration=cdk.Duration.days(365)
                )
            ]
        )

        processed_data_bucket = s3.Bucket(self, "ProcessedDataBucket")

        # Compute Layer - ECS Fargate for processing
        cluster = ecs.Cluster(self, "DataProcessingCluster", vpc=vpc)

        # ECS Task with Spot pricing for cost savings
        processing_task = ecs_patterns.ScheduledFargateTask(
            self, "DataProcessingTask",
            cluster=cluster,
            scheduled_fargate_task_definition_options=ecs_patterns.ScheduledFargateTaskDefinitionOptions(
                cpu=256,
                memory_limit_mib=512,
                image=ecs.ContainerImage.from_asset("./processing_container")
            ),
            schedule=events.Schedule.cron(hour="1", minute="0"),  # Run at 1 AM daily
            subnet_selection=ec2.SubnetSelection(subnet_type=ec2.SubnetType.PRIVATE)
        )

        # Add Spot capacity provider for cost savings
        spot_capacity_provider = ecs.AsgCapacityProvider(
            self, "SpotCapacityProvider",
            auto_scaling_group=autoscaling.AutoScalingGroup(
                self, "ASG",
                vpc=vpc,
                instance_type=ec2.InstanceType("t3.small"),
                machine_image=ecs.EcsOptimizedImage.amazon_linux2(),
                min_capacity=0,
                max_capacity=10,
                spot_price="0.0104"  # Set a maximum spot price
            )
        )
        cluster.add_asg_capacity_provider(spot_capacity_provider)

        # Serverless Lambda for event-driven processing
        event_processor = lambda_.Function(
            self, "EventProcessor",
            runtime=lambda_.Runtime.PYTHON_3_9,
            code=lambda_.Code.from_asset("lambda"),
            handler="processor.handler",
            memory_size=128,
            timeout=cdk.Duration.seconds(30),
            environment={
                "PROCESSED_BUCKET": processed_data_bucket.bucket_name
            }
        )

        # Data Warehouse - RDS Postgres with scheduled scaling
        db_credentials = rds.Credentials.from_generated_secret("dbadmin")

        data_warehouse = rds.DatabaseInstance(
            self, "DataWarehouse",
            engine=rds.DatabaseInstanceEngine.postgres(version=rds.PostgresEngineVersion.VER_13),
            vpc=vpc,
            instance_type=ec2.InstanceType.of(ec2.InstanceClass.BURSTABLE3, ec2.InstanceSize.MEDIUM),
            credentials=db_credentials,
            multi_az=False,  # Start with single AZ for cost savings
            allocated_storage=20,
            max_allocated_storage=100,  # Allow auto-scaling
            backup_retention=cdk.Duration.days(7),
            deletion_protection=False,
            storage_encrypted=True,
            storage_type=rds.StorageType.GP2,
        )

        # Lambda function for scheduled scaling
        db_scaler = lambda_.Function(
            self, "DBScaler",
            runtime=lambda_.Runtime.PYTHON_3_9,
            code=lambda_.Code.from_asset("lambda/db_scaler"),
            handler="index.handler",
            environment={
                "DB_INSTANCE_ID": data_warehouse.instance_identifier,
                "BUSINESS_HOURS_INSTANCE_CLASS": "db.t3.medium",
                "OFF_HOURS_INSTANCE_CLASS": "db.t3.small"
            }
        )

        # Grant permissions to scaler
        db_scaler.add_to_role_policy(iam.PolicyStatement(
            actions=["rds:ModifyDBInstance", "rds:DescribeDBInstances"],
            resources=["*"]
        ))

        # Schedule scaling down in evening
        scale_down_rule = events.Rule(
            self, "ScaleDownRule",
            schedule=events.Schedule.cron(hour="18", minute="0")  # 6 PM
        )
        scale_down_rule.add_target(targets.LambdaFunction(db_scaler))

        # Schedule scaling up in morning
        scale_up_rule = events.Rule(
            self, "ScaleUpRule",
            schedule=events.Schedule.cron(hour="8", minute="0")  # 8 AM
        )
        scale_up_rule.add_target(targets.LambdaFunction(db_scaler))

        # Dashboard container with Metabase (open-source BI)
        dashboard_task = ecs_patterns.ApplicationLoadBalancedFargateService(
            self, "DashboardService",
            cluster=cluster,
            cpu=512,
            memory_limit_mib=1024,
            task_image_options=ecs_patterns.ApplicationLoadBalancedTaskImageOptions(
                image=ecs.ContainerImage.from_asset("./metabase_container"),
                container_port=3000,
                environment={
                    "MB_DB_TYPE": "postgres",
                    "MB_DB_HOST": data_warehouse.db_instance_endpoint_address,
                    "MB_DB_PORT": data_warehouse.db_instance_endpoint_port,
                    "MB_DB_USER": "metabase",
                    "MB_DB_PASS": "PLACEHOLDER"  # Use secrets in practice
                }
            ),
            desired_count=1,
            public_load_balancer=True
        )

        # Grant appropriate permissions
        raw_data_bucket.grant_read(processing_task.task_definition.task_role)
        processed_data_bucket.grant_write(processing_task.task_definition.task_role)
        processed_data_bucket.grant_read(event_processor)
        data_warehouse.connections.allow_from(
            dashboard_task.service.connections,
            ec2.Port.tcp(5432),
            "Allow dashboard service to connect to database"
        )

Decision Rules

Use this checklist for data stack cost decisions:

  1. If storage costs are high, implement lifecycle policies to move old data to cheaper tiers
  2. If compute costs are high, profile workloads before scaling - bottlenecks vary
  3. If commercial tool costs are high, evaluate open-source alternatives for your workload size
  4. If you have idle resources, use scheduled scaling or serverless options
  5. If costs are unpredictable, implement cost monitoring with alerts before they spiral

Right-size first, optimize second. Premature optimization of non-bottlenecks wastes effort.

Ready to Implement These AI Data Engineering Solutions?

Get a comprehensive AI Readiness Assessment to determine the best approach for your organization's data infrastructure and AI implementation needs.

Similar Articles

The Rise of GPU Databases for AI Workloads
The Rise of GPU Databases for AI Workloads
22 Jan, 2024 | 03 Mins read

Traditional relational database management systems were designed for an era of megabyte-scale datasets and batch reporting. AI workloads demand processing terabyte-scale datasets with complex analytic