BigQuery on a Budget: Essential Tips to Optimize and Save
I love BigQuery. It’s like the Swiss Army knife of data analytics—throw any data at it, and it just works. For our small analytics team, it’s been a game-changer, helping us iterate quickly, build new features, and respond to urgent requests with ease.
But let’s be real: BigQuery can be a money pit if you’re not careful.
The Cost Trap
We found this out the hard way. A big client had issues in specific global regions, and we needed data insights fast. BigQuery helped us analyze everything from node performance to stream ingestion locations in record time. But then came the kicker—our BigQuery costs doubled.
The culprit? Our rapid-response fixes were efficient in delivering results but not in managing costs. It was time to optimize.
Optimization Time
Let’s talk about how we slashed our costs by 90% using three simple methods: Incremental Tables, Static Partitions, and Source Table Partitions.
1. Incremental Tables
Incremental tables in dbt only process new or updated data. This makes them perfect for large, growing datasets where you don't need to rewrite history. Using dbt’s is_incremental()
function, we set up these tables to only update what’s new. This cut down on the amount of data processed, saving us both time and money.
2. Static Partitions
Dynamic partitions are great, but we took it further by using static partitions. Instead of letting BigQuery figure out which partitions to update, we told it exactly which ones to replace. For a table partitioned by day, we only replaced the current day’s partition. This simple change significantly reduced the number of full-table scans and shaved off a big chunk of our costs.
3. Source Table Partitions
Finally, we realized our source table wasn’t partitioned, making every query a costly full-table scan. We fixed this by copying the data into a new table with a daily partition on the updated_at
timestamp. Now, our queries are more efficient, and our costs are lower.
Code Example
Here’s how we put it all together in our dbt models:
- Set up an incremental table to handle only new or updated data.
- Use static
partitions_to_replace
to specify exactly which data to replace and query. - (Not shown) Make sure your source table is partitioned, using something like the
updated_at
field. - Apply these partitions when querying and writing data to maximize efficiency.
{% set partitions_to_replace = [
'timestamp(current_date)'
] %}
{{
config(
materialized="incremental",
incremental_strategy='insert_overwrite',
partition_by={
"field": "created_at_ts",
"data_type": "timestamp",
"granularity": "day",
},
unique_key="id",
)
}}
with
--this is the data that is read
source as (
select * from src_table
from {{ref('studio_profiles')}}
{% if is_incremental() %}
-- this filter will only be applied on an incremental run
where timestamp_trunc(updated_at, day) in (
{{ partitions_to_replace | join(',') }}
)
{% endif %}
)
-- this is the data that is written
select * from source
{% if is_incremental() %}
-- this filter will only be applied on an incremental run
where timestamp_trunc(created_at_ts, day) in (
{{ partitions_to_replace | join(',') }}
)
{% endif %}
Wrapping Up
By applying these three techniques—Incremental Tables, Static Partitions, and Source Table Partitions—we managed to cut our BigQuery costs by 90%. Yes, these optimizations add some complexity, but the savings are well worth it.
The key takeaways?
- Use incremental tables to process only new or updated data.
- Implement static partitions to control exactly what gets processed.
- Ensure your source tables are partitioned for efficient querying.
With these strategies in place, you can enjoy the power of BigQuery without breaking the bank.