Using dbt to create analytics for a Woocommerce site
We use a WordPress + Woocommerce setup for our website, which makes us use a ridiculously inefficient database schema to store orders and inventory.
The annoying meta value schema
Orders are blog posts stored in the wp_posts
table, with attributes such as payment method, shipping address, and billing address stored in a wp_postmeta
table. Order line items are stored in a woocommerce_order_item
table, with order line item attributes being stored in a woocommerce_order_itemmeta
table. For each attribute, a join with the relevant post/line item table has to be made.
Not only is this extremely tedious to write, but it is also not very performant.
Writing nicer queries with dbt macros
A lot of the queries end up looking like
SELECT wp.ID AS order_id,
wpm_order_number.meta_value AS order_number,
wpm_payment_method.meta_value AS payment_method,
wpm_billing_email.meta_value AS billing_email,
FROM wp_posts wp
LEFT JOIN wp_postmeta wpm_payment_method ON wpm_payment_method.post_id = wp.ID
AND wpm_payment_method.meta_key = '_payment_method'
LEFT JOIN wp_postmeta wpm_order_number ON wpm_order_number.post_id = wp.ID
AND wpm_order_number.meta_key = '_order_number'
LEFT JOIN wp_postmeta wpm_billing_email ON wpm_billing_email.post_id = wp.ID
AND wpm_billing_email.meta_key = '_billing_email'
dbt makes it possible to write jinja macros to make the SQL writing a bit more pleasant.
{% macro wp_meta_join(name, src) %}
LEFT JOIN {{source(var('source'), 'wp_postmeta')}} wpm_{{name}}
ON {{src}} = wpm_{{name}}.post_id
AND wpm_{{name}}.meta_key = '_{{name}}'
{% endmacro %}
SELECT wp.`ID` AS order_id,
wpm_order_number.meta_value AS order_number,
wpm_payment_method.meta_value AS payment_method,
wpm_billing_email.meta_value AS billing_email,
FROM {{source(var('source'), 'wp_posts')}} wp
{{ wp_meta_join('payment_method', 'wp.`ID`') }}
{{ wp_meta_join('order_number', 'wp.`ID`') }}
{{ wp_meta_join('billing_email', 'wp.`ID`') }}
Furthermore, we can write helper macros to help us deal with the very lenient wp_postmeta
schema (since every thing has to be stored as a string in meta_value
):
{% macro mv_decimal(col) %}
cast(coalesce(nullif({{ col }}.meta_value, ''), 0) as decimal(12, 2))
{% endmacro %}
which allows us to write
SELECT
{{mv_decimal('wpm_order_total')}} AS order_total,
An early PostgreSQL datalake
An earlier version of the datalake had us syncing our MySQL database into a PostgreSQL database; first using fivetran, and when that proved to be too expensive because of wp_postmeta
driving absurdly high monthly active rows, and then using pg_chameleon running on AWS ECS (more about that setup in another post). This allowed us to make use of the more advanced tooling for views and index management that PostgreSQL offers (compared to MySQL 5.7).
A big problem, however, was that the sync kept breaking. pg_chameleon tries very hard to extract constraints from the MySQL schema, and translate it into PostgreSQL schema constraints. But MySQL actually doesn't really enforce these schemas, while PostgreSQL does, which always led to dirty rows not making it, which was hard to debug.
Ultimately, the datalake was abandoned and we did most of our analytics with slow and verbose redash queries.
2 years later, I checked back in with dbt, and noticed that dbt-mysql
seemed robust enough to give it another spin.
Migrating a dbt setup from PostgreSQL to MySQL
Migrating the dbt models from PostgreSQL to MySQL was actually easier than feared. Because of the heavy use of macros in the models, we had to adjust some of the typecasting macros mentioned above and not have to change models very much.
One big change was the lack of WITH
statements, which required the creation of intermediate models. Thanks to the dependency computation that dbt does, that was just a matter of moving the WITH statements to their files. The added advantage is that I can now reuse the new models for queries down the road and run tests on them to ensure they are valid (in conjunction with the already written tests on the resulting model).
What used to be:
WITH line_items AS (
SELECT wwoi.order_item_id,
wwoi.order_id,
wwoi.order_item_name AS name,
{{ product_or_variation_id('wwoim_product_id.meta_value',
'wwoim_variation_id.meta_value') }} AS product_id,
...
FROM {{source('prod', 'wp_woocommerce_order_items')}} wwoi
{{ order_item_meta_join('product_id') }}
{{ order_item_meta_join('variation_id') }}
...
LEFT JOIN {{source('prod', 'wp_posts')}} wp
ON wp."ID" = wwoi.order_id
AND wp.post_status = 'wc-completed'
ORDER BY wwoi.order_id DESC, wwoi.order_item_id)
SELECT oli.*,
p.current_price,
...
FROM {{ref('line_items')}} oli
LEFT JOIN {{ ref('products')}} p ON oli.product_id = p.product_id
LEFT JOIN {{ ref('orders')}} o ON oli.order_id = o.order_id
is now:
SELECT oli.*,
p.current_price,
...
LEFT JOIN {{ ref('products')}} p ON oli.product_id = p.product_id
LEFT JOIN {{ ref('orders')}} o ON oli.order_id = o.order_id
Performance issues
What cost us the most time, however, was getting the MySQL performance up to speed.
The core issue is that WordPress doesn't create a multicolumn index on wp_postmeta
that allows one to do an efficient “double-field join”:
{% macro wp_meta_join(name, src) %}
LEFT JOIN {{source(var('source'), 'wp_postmeta')}} wpm_{{name}}
ON {{src}} = wpm_{{name}}.post_id
AND wpm_{{name}}.meta_key = '_{{name}}'
{% endmacro %}
In order to speed this operation up, the creation of the following indexes was necessary:
CREATE INDEX `bk1` ON `ttc_prod`.wp_postmeta (`post_id`,`meta_key`);
CREATE INDEX `bk1` ON `ttc_prod`.wp_woocommerce_order_itemmeta (`order_item_id`,`meta_key`);
There is a dedicated WordPress plugin for this as well: Index WP MySQL for Speed that Oliver Jones pointed out to me on Mastodon. He also directed me to more information about performance and index creation for WordPress installations.
(Caveat: I haven't dug too deep into MySQL performance, and the following is possibly wrong and/or naive).
The views that MySQL creates are unable to reuse multi-column indexes, which meant that most of our models needed manual index creation, which is only possible on an actual table. This meant that we had to materialize most of our views (not really an issue, ultimately, we only have small amounts of data). From our dbt_project.yml
file:
# we need to materialize these to have decent indexes
customer_order_counts_per_year:
materialized: table
customer_order_counts_total:
materialized: table
customer_orders:
materialized: table
line_items:
materialized: table
Furthermore, I had to add post-hooks
entries to the individual models to create the necessary indexes, again reusing helper macros to avoid having to write too much manual SQL:
{% macro mysql_add_text_index(this, column, len)%}
create index {{ this.name }}__index_on_{{ column }} on {{ this }} ({{ column }}({{len}}))
{% endmacro %}
{% macro mysql_add_index(this, column)%}
create index {{ this.name }}__index_on_{{ column }} on {{ this }} ({{ column }})
{% endmacro %}
{% macro mysql_add_uindex(this, column)%}
alter table {{ this }} add unique index {{ this.name }}__index_on_{{ column }} ({{ column }})
{% endmacro %}
{{
config({
"post-hook": [
"{{ mysql_add_text_index(this, 'billing_email', 12)}}",
],
})
}}
With these changes, most of our models could be computed quite rapidly.
However, one oddity remained. For a very specific materialized table (not even a big one), the following model using it would hang. I didn't have time to investigate this more closely, although I'm sure I would learn much about MySQL. I ended up adding a DO SLEEP(5)
to the pre-hook of that model. This is really just like embedded programming, if you ask me.
{{
config({
"pre-hook": [
"DO SLEEP(5)",
],
"post-hook": [
"{{ mysql_add_text_index(this, 'billing_email', 12) }}",
"{{ mysql_add_index(this, 'fiscal_year') }}",
"{{ mysql_add_index(this, 'previous_fiscal_year') }}",
]
})
}}
Conclusion
With these changes, I could move our analytics pipeline to run directly against our production DB, with no performance impact. WordPress is already quite the rapacious beast in terms of performance; one more query run twice daily with 18 joins didn't really register.