DBT and FinOps data - a match made in heaven

Jul 02, 2024

💡Hello reader 👋, I just got back from FinOps X - the big yearly FinOps conference - and am feeling inspired to tackle a series that's been brewing in my imagination for quite some time on the topic of cost allocation.

DBT and FinOps data - a match made in heaven

Cost allocation is a common FinOps responsibility and while the methodology is particular to each business, I think the general shape of the process is probably similar enough that a few posts about our methodology might be useful to put out there.

Cost allocation is broadly about figuring out "what costs what?" in the vocabulary of your business, rather than the cloud vendor's vocabulary present in the monthly bill. It directly relates to a lot of FinOps visibility work and often builds upon previous allocation work, so let's dive in with a primer discussion on the tools that I use here to manage it all.

Prologue

A recent leg of my career path found me spinning up the data function at my current employer. We'd previously had no holistic treatment of the company's data - minimal BI, minimal data warehousing, and as a result no visibility into several business processes that were starting to present themselves as a strategic problem for the business. I'd spent the previous few years in customer facing management roles, quietly cataloging issues that somebody was going to have to solve someday when we get around to hiring a data person. At some point I started daydreaming about how I’d tackle it myself.

After multiple quarters of staring at the problem I raised my hand and asked if we could do something about this situation, officially, deliberately. They said "sure, go for it" and so I spent the next few months researching the data-tech landscape and revisiting the Kimball et al section of my bookshelf to formulate a plan.

We had some key technical pieces in place already. Google BigQuery was already being used as a semi-structured destination for some data feeds from engineering and the platform side of the business. I picture BigQuery as the bottom layer of the cake. Metabase was already installed as a BI dashboard with a MySQL database as the backend for some data pipelines pulling from our ticketing system and a few other internal sources. Metabase - capable enough to not be the top priority - is where most users will be querying the warehouse and building reports and functions as the top layer of the cake. Without turning this into a series on how to build out a modern data platform, that left us with "the middle layer" - getting all the raw data into BigQuery and then, most importantly, reshaping it to fit the business needs and vocabulary of my employer.

I didn't know how important a piece it would eventually become, but I kept seeing mention of a tool called DBT in all the data job descriptions I was checking out as part of my research. I didn't understand it on first glance, so I'll tell you what it is right now...

DBT

DBT is a framework for building out a data warehouse, in much the same way that Ruby on Rails is a framework for building out a web application. It's a set of conventions, helper functions, and supporting software bits that an open source community has collected together into a software toolkit - a framework.

In short, DBT provides tooling to help with writing SQL queries against raw data in your warehouse, and with managing the resulting tables and views. These queries are rarely finished reports (though they can be). More often they take the form of pulling together raw data from all over your warehouse into 1 location (schema), and cleaning up the data along the way.

Put more simply - you have data all over the place, and most of it is a mess. DBT helps you manage it, refine it, and document it all in one place. It also allows you to manage a large data project more like a large software project - version control, CI/CD automation, some support for testing, etc.

From ETL to ELT

If you've been anywhere near data in a business setting in the last 30 years, you've heard the acronym "ETL" - extract, transform, load, in that order. Well, the paradigm has shifted in the cloud era toward a model of "ELT" - extract, load, transform. Get the data from the source, store it in your cloud warehouse or S3 bucket exactly as it came from the source, and then transform it later.

This approach has the benefit of allowing you to change your mind about the logic in your transformations when you inevitably need to, without reimporting the data all over again. This method largely decouples the constantly evolving data analysis needs of the business from the upstream data engineering teams that are responsible for procuring the data.

flowchart TB a[Raw AWS CUR billing data in your data warehouse]-->b[DBT executes SQL statement, creating a new SQL view or table] b-->c[Your BI tool queries against this cleaned up table] c-->d d[You discover an error in the original SQL transform logic, or you learn more about the underlying data and have a better idea]-->e[You update the logic and recreate the SQL view. Note that you didn't have to reimport any data.] e-->c

DBT is the T step in ELT and you do the transforming with plain old SQL. How about a short example?

-- in models/aws_focus.sql
SELECT 
  `bill/BillingPeriodStartDate` AS BillingPeriodStart,
  `bill/BillingPeriodEndDate` AS BillingPeriodEnd,
  `lineItem/UsageStartDate` AS ChargePeriodStart,
  `lineItem/UsageEndDate` AS ChargePeriodEnd,
  `lineItem/UnblendedCost` AS BilledCost,
  ...
FROM {{ source('billing_data', 'aws_cur_raw') }}
-- this last bit is DBT code that selects the aws_cur_raw table from
-- the billing_data database (or schema).  More on this later.

Suppose you've already done the work of ingesting your AWS bill into your data warehouse. If you finish this SQL statement above with all 43 columns in the current FOCUS spec, you can do dbt run -s aws_focus and DBT will execute this SQL against your data warehouse and create a SQL view - essentially a pre-written SQL query that functions just like a virtual table - that you can query with your BI tool. Congrats, you have just reformatted your historical AWS billing data into the FOCUS spec without any new data pipelines, and without breaking your ability to query historical billing data.

I'll have a lot more cool examples in the next post.

Ok cool, but ... why?

As I have said elsewhere in this blog - it is my opinion that the data part of FinOps is a specific domain of business intelligence, BI. BI done correctly benefits from the additive effects of being able to connect disparate data sources together to derive insights that a single data source alone might not make obvious.

Data and BI become more valuable as you are able to connect together more of your company's data.

If you are able to manage your cloud billing data in the same place as, say, customer support requests, or Stripe transactions, you are more able to build an automated, holistic picture of your company's operations. It's work, but it's possible to build data products specific to your business that no 3rd party FinOps tool could ever accommodate. Even the very best 3rd party tool is going to reach the limits of its capability to integrate with your business' data, and when you reach that limit, you'll be exporting data to a spreadsheet all over again.

Moreover, you likely already have some of these pieces lying around at your company already - a BI dash connected to a data warehouse, for example. That's several sales cycles, procurement, onboarding and all the rest of the baggage associated with a new tool that you can refocus on getting to work on your specific set of questions.

Lastly, if you go the path of technical ownership of your FinOps visibility effort, you will gain unparalleled subject matter expertise and true mastery over the data that forms the foundation of the entire practice of FinOps.

In closing

That's it for today, the next post will be more specific examples of the possibilities of bringing an open source tool like DBT to your FinOps data - we're going to implement the FOCUS spec in SQL.

The Finoperator

Discussion about this post