Keep Your Warehouse from Drowning in Cold Data: A Practical Baselining Playbook
A three-layer playbook — lifecycle policies, usage-based cleanup, and team rituals — to stop warehouse storage from compounding silently.
Most warehouse cost conversations focus on compute — which credits are cheapest, which cluster size is right-sized, whether to commit to reserved capacity. But as we covered in The Real Cost of Snowflake vs Databricks vs BigQuery, compute isn't the part that quietly compounds. Storage is.
A team starts with 500 GB. Two years later they're sitting on 50 TB, and maybe 20% of it is actively queried. Nobody deletes anything because the fear of breaking a downstream consumer — or needing the data the day after you drop it — always outweighs the storage line on the bill. Until the storage line on the bill gets noticed.
The answer isn't a one-off cleanup sprint. It's data baselining: a repeatable program that makes sure your warehouse stops growing faster than your business does. This post is the playbook.
The three layers of baselining
There's a maturity model here. You don't jump straight to the most sophisticated approach — you build capability in layers, each giving you higher-quality signal than the one below it.
The layers compose — Layer 3 doesn't replace Layer 1; it sits on top of it. A mature team runs all three, each catching what the layers below can't.
A case study: three weeks, 19% weekly-cost reduction
The framework above isn't theoretical — we recently ran this playbook on a shared AWS account that had become a graveyard of experiments. Over roughly six years, the account had accumulated resources from multiple teams: hundreds of S3 buckets from decommissioned deployment pipelines, orphaned ECR repositories, abandoned Glue jobs, stale CloudFormation stacks. All generating ongoing cost with no business value, all continuously surfacing security and compliance risks. Many of the original creators had left their teams. Most resources had no tags and no owner.
Over three weeks, we systematically baselined every resource in the account. Here's what came out:
| Resource Type | Before | After | Removed | Reduction |
|---|---|---|---|---|
| S3 Buckets | 678 | 240 | 438 | 65% |
| ECR Repositories | 160 | 60 | 100 | 63% |
| Glue Jobs | 76 | 18 | 58 | 76% |
| CloudFormation Stacks | 175+ | 130 | 45+ | ~26% |
And the weekly cost impact by service:
| Service | Before | After | Savings/week |
|---|---|---|---|
| S3 | $2,472 | $1,979 | -$494 (-20%) |
| SageMaker | $580 | $255 | -$326 (-56%) |
| MWAA (Managed Airflow) | $792 | $522 | -$269 (-34%) |
| EC2 / Compute | $205 | $84 | -$121 (-59%) |
| Lambda | $142 | $89 | -$54 (-38%) |
Weekly cost dropped from $6,322 to $5,128 — a 19% reduction, or roughly $62K/year annualized, achieved in 15 working days. A few patterns showed up that are worth flagging up front, because they generalize to every warehouse and data-platform cleanup:
- Start with the source of truth, not the cleanup. We built a complete inventory first, then classified, then deleted in waves. Early attempts to "just delete the obvious ones" would have missed dependencies — like an Athena query-results bucket that looked auto-generated but was actively used by a production BI dashboard.
- AI-assisted auditing made the difference. Manually investigating 678 S3 buckets, 160 ECR repos, and 76 Glue jobs would have taken weeks. Using AI assistants to generate per-resource reports, cross-reference CloudFormation stacks, read deployment-artifact contents, and classify resources reduced it to days. Humans still handled the ambiguous 10–20%.
- "Empty" doesn't mean "safe to delete." Empty Elastic Beanstalk buckets had active environments. Empty S3 buckets had live SNS notifications. The zero-false-positive requirement forced us to check notifications, bucket policies, logging targets, and cross-service dependencies — not just object counts.
With that context, the three layers below are the structure we used. The rest of this post is the generalized playbook.
Layer 1 — Lifecycle policies
This is the cheapest win. Every cloud warehouse and object store has mechanisms to auto-expire data; most teams just never configure them beyond the defaults.
Object storage. S3, GCS, and Azure Blob all support lifecycle rules: transition to colder tiers after N days, delete after M days. Your landing zone and staging buckets almost certainly don't need 1-year retention at hot-storage rates. 30 days of standard storage followed by Glacier or deletion is usually fine for raw ingest.
Warehouse-native. Each warehouse has its own retention primitives:
- Snowflake:
DATA_RETENTION_TIME_IN_DAYSfor Time Travel (default 1 day on Standard edition, but up to 90 on Enterprise — and Enterprise defaults can silently be high), plus Fail-safe (7 days, non-configurable, but only applies to permanent tables). UseTRANSIENTtables for anything that doesn't need Fail-safe; it cuts storage overhead roughly in half. - Google BigQuery: table-level and partition-level expiration timestamps. Set them at create time on any table that has a natural shelf-life (staging, sandbox, scratch). Also: BigQuery's automatic long-term storage pricing is a free win — data not modified for 90 days drops to roughly half the active rate, no action required.
- Databricks: Delta tables have
delta.logRetentionDuration(default 30 days) anddelta.deletedFileRetentionDuration(default 7 days), cleaned up byVACUUM. The default is conservative; if you're not doing time-travel queries beyond a week, shortening these reclaims real storage.
The rule of thumb. Tag your tables by lifecycle intent at creation time — staging, raw, transient, permanent, compliance — and let the tag drive the policy. Retroactively applying retention to tables that have lived without it for two years is risky: by that point you often don't know which jobs, dashboards, or ad-hoc users depend on the history, so any aggressive TTL risks silently breaking a downstream consumer. Apply it at creation and the problem doesn't compound in the first place.
A trap to watch for — pipeline orphans. In our audit, deployment pipelines turned out to be one of the biggest sources of silent storage growth. Every CI/CD pipeline tends to create 4–6 AWS resources (deployment buckets, logging buckets, ECR repos across regions) with DeletionPolicy: Retain. When the pipeline itself gets decommissioned, all of those resources persist. Your warehouse's equivalent is the set of temp tables, intermediate layers, and staging datasets that your ETL tool creates and then never cleans up. Add a cleanup checklist to your pipeline decommissioning workflow, or these will quietly cost you forever.
Layer 1 will solve maybe 40% of your cold-data problem. It's a floor, not a ceiling.
Layer 2 — Usage-based cleanup
Lifecycle policies catch intended transience. They don't catch tables that were created for a specific purpose, that purpose ended, and nobody ever removed them. That's what Layer 2 is for.
The input signal is usage metadata. Every warehouse exposes it:
- Snowflake:
ACCESS_HISTORYandQUERY_HISTORYviews give you last-queried times and access counts per object. - BigQuery:
INFORMATION_SCHEMA.JOBShas the same at the project or organization level. - Databricks: system tables (
system.access.table_lineage,system.query.history) expose equivalent data.
A basic cold-scoring rule — the one most teams converge on after a few iterations — looks roughly like:
A table is a cold candidate if:
- it hasn't been queried in the last 180 days, AND
- no active DAG, dbt model, or BI dashboard depends on it, AND
- it doesn't carry a
pii,finance,compliance, ordo_not_touchtag.
The first two you can query. The third has to come from metadata you maintain — which means tagging discipline is the gating factor for all of Layer 2.
Once you have a list of cold candidates, you don't immediately drop them. The safe path is a graduated workflow:
The rename-and-wait pattern matters. A renamed table breaks queries loudly and recoverably. A dropped table breaks queries destructively. The 30-day window is your safety net for the consumer you didn't know existed — almost always a dashboard or a notebook, not a production pipeline.
Watch scheduled jobs, not just tables. In our audit, 6 Glue jobs were running on schedule, failing every execution, burning DPU-hours for months — because nobody was looking. The analogue in the warehouse world is a scheduled query or dbt job that keeps running long after its project died. Alert on consecutive failures, auto-disable after N, and include "jobs with no successful run in 90 days" in your Layer 2 sweeps. Compute costs from zombie schedules are harder to spot than storage because they don't accumulate visibly.
The downside of Layer 2 is that automation can only answer technical questions. It can't tell you whether a table is still useful to the business, only whether anything has queried it. Which is why you also need Layer 3.
Layer 3 — Team baselining rituals
This is the layer most teams skip because it looks like meetings, not engineering. It's also the layer that catches the highest-value problems.
The ritual is simple: once a quarter — or monthly if your backlog is big enough to warrant it — the data team spends 60–90 minutes walking through a curated list of "storage consumers that deserve a human look." For each, someone owns a decision — keep, archive, drop, or "owner to clarify by date X" — and it gets logged.
The list to bring to the meeting:
- Top 20 largest tables, annotated with last-queried date, owner, and downstream references
- Top 20 fastest-growing tables in the last 90 days (the "what new mess did we create" list)
- Orphaned intermediate tables — anything not referenced by a production dbt model or orchestrator DAG
- Buckets / datasets with no access in 180 days (Layer 2 output, pre-filtered)
- Tables without owners (if you have a tagging convention, this is a one-line query; if you don't, this is your action item for next quarter)
The output of the meeting isn't a cleanup plan. It's decisions with names attached. The data engineer on call for next sprint isn't the right person to decide whether the 2024 marketing attribution table is still needed — the marketing analytics lead is. Layer 3's job is to route the decision to the right human and record what they said.
A few things that make this ritual work in practice:
- Rotate the facilitator. If it's always the same person, it becomes their problem and then stops happening.
- Celebrate the deletions. It feels counterintuitive to celebrate reducing scope, but teams that don't create some positive association with deletion will always default to keep.
- Keep the log. A simple table of "table name · decision · owner · date" becomes invaluable six months later when someone asks "why did we drop this?"
- Budget for it. It's not free time — put it on the sprint.
- Treat it as a learning session. An under-appreciated side-effect: these meetings are where the team actually learns the platform. People explain what a dataset is for, which pipeline produces it, why a particular warehouse feature was chosen, what went wrong last time someone tried to drop it. New hires ramp up faster, tribal knowledge gets externalized, and best practices (and anti-patterns) spread naturally — all from a meeting whose nominal purpose was deleting tables.
Layer 3 is where you catch the tables that Layers 1 and 2 can't: the ones that are queried occasionally by a human doing ad-hoc analysis but have no business value, and the ones that are genuinely still needed but tagged wrong and would have been auto-archived.
The uncomfortable truth about ownership. Shared environments without ownership tracking become graveyards. In our audit, most resources had no tags and no owner — the people who created them had left their teams long ago. Reconstructing that context was an investigation in its own right: reading template names inside deployment artifacts, correlating resources with long-archived pipelines, contacting former teammates and neighbouring teams, digging through old wiki pages and Slack archives. None of it was hard in isolation, but doing it for hundreds of resources is what made it expensive. Tagging at creation time (owner, project, team, lifecycle, sensitivity) is nearly free. Reconstructing it retroactively is 10–100x more expensive. Layer 3's recurring ritual is also where you notice ownership decay before it becomes a graveyard — every untagged or ex-employee-owned table surfaced in the review is a chance to re-home it now, while someone still remembers what it was for.
A rollout plan
If you're starting from scratch, don't try to deploy all three layers at once. Sequence them.
- Week 1 — Layer 1 quick wins. Audit your warehouse retention defaults. Lower Time Travel to 1–7 days on staging. Set partition expiration on BigQuery sandboxes. Enable S3/GCS lifecycle rules on raw-ingest buckets. Most teams reclaim 15–30% of storage in this week alone.
- Month 1 — Layer 2 foundations. Turn on
ACCESS_HISTORYor the equivalent. Write your cold-candidate query and run it weekly in shadow mode — produce the list, don't act on it. Start tagging new tables with lifecycle + sensitivity metadata. - Quarter 1 — first Layer 3 review. Even if Layer 2 automation isn't live yet, hold the first team baselining meeting using whatever data you can pull manually. Get the calendar invite recurring.
- Year 1 — full loop. Layer 2 promoted from shadow mode to automated rename-and-wait. Layer 3 running quarterly. Tagging discipline is now a PR-review item for new tables.
The goal isn't a smaller warehouse
The goal is a warehouse where you trust what's in it.
Trust means: you know every large table has a reason to exist, an owner who would notice if it broke, and a policy that decides what happens to it next. That's not a cost-optimization outcome — it's a governance outcome that happens to be much cheaper than not having one.
And the cheapest baselining is the baselining you don't have to do. Almost every painful finding in this post — orphaned pipelines, tables with no owner, policies applied years too late — exists because the right conventions weren't in place on day one. Tagging at creation time, lifecycle intent declared up front, pipeline decommissioning checklists, ownership as a hard requirement for any new dataset: none of these cost anything when you design them in from the start, and all of them are 10–100x more expensive to retrofit later. Baselining is how you recover from not having done this; designing the platform correctly from the beginning is how you keep baselining to a minimum forever.
Back to where this started: the cost conversation. The compute bill is visible; it shows up in weekly reviews. The storage bill is invisible until it isn't. Baselining is the mechanism that keeps your Snowflake, BigQuery, or Databricks bill from quietly compounding while you sleep. Layer 1 alone will pay for itself in a quarter. Adding Layer 3 will make your team better at every other part of owning the warehouse.
Part 2 of 2 in our warehouse-cost series:
- ← Part 1: The Real Cost of Snowflake vs Databricks vs BigQuery
- Browse the data warehouse category for every option we track
- See our best data warehouses ranking for editorial picks
Written by Egor Burlakov
Engineering and Science Leader with experience building scalable data infrastructure, data pipelines and science applications. Sharing insights about data tools, architecture patterns, and best practices.
Explore Further
Dive deeper into the tools and categories mentioned in this article.