@mnl@hachyderm.io

Using dbt to create analytics for a Woocommerce site

We use a WordPress + Woocommerce setup for our website, which makes us use a ridiculously inefficient database schema to store orders and inventory.

The annoying meta value schema

Orders are blog posts stored in the wp_posts table, with attributes such as payment method, shipping address, and billing address stored in a wp_postmeta table. Order line items are stored in a woocommerce_order_item table, with order line item attributes being stored in a woocommerce_order_itemmeta table. For each attribute, a join with the relevant post/line item table has to be made.

Not only is this extremely tedious to write, but it is also not very performant.

Writing nicer queries with dbt macros

A lot of the queries end up looking like

SELECT wp.ID AS order_id,
          wpm_order_number.meta_value AS order_number,
          wpm_payment_method.meta_value AS payment_method,
          wpm_billing_email.meta_value AS billing_email,
   FROM wp_posts wp
   LEFT JOIN wp_postmeta wpm_payment_method ON wpm_payment_method.post_id = wp.ID
   AND wpm_payment_method.meta_key = '_payment_method'
   LEFT JOIN wp_postmeta wpm_order_number ON wpm_order_number.post_id = wp.ID
   AND wpm_order_number.meta_key = '_order_number'
   LEFT JOIN wp_postmeta wpm_billing_email ON wpm_billing_email.post_id = wp.ID
   AND wpm_billing_email.meta_key = '_billing_email'

dbt makes it possible to write jinja macros to make the SQL writing a bit more pleasant.

{% macro wp_meta_join(name, src) %}
LEFT JOIN {{source(var('source'), 'wp_postmeta')}} wpm_{{name}} 
       ON {{src}} = wpm_{{name}}.post_id
	   AND wpm_{{name}}.meta_key = '_{{name}}'
{% endmacro %}
SELECT wp.`ID`                                     AS order_id,
	wpm_order_number.meta_value                    AS order_number,
	wpm_payment_method.meta_value                  AS payment_method,
	wpm_billing_email.meta_value                   AS billing_email,
FROM {{source(var('source'), 'wp_posts')}}         wp
{{ wp_meta_join('payment_method', 'wp.`ID`') }}
{{ wp_meta_join('order_number', 'wp.`ID`') }}
{{ wp_meta_join('billing_email', 'wp.`ID`') }}

Furthermore, we can write helper macros to help us deal with the very lenient wp_postmeta schema (since every thing has to be stored as a string in meta_value):

{% macro mv_decimal(col) %}
cast(coalesce(nullif({{ col }}.meta_value, ''), 0) as decimal(12, 2))
{% endmacro %}

which allows us to write

SELECT
	{{mv_decimal('wpm_order_total')}}    AS order_total,

An early PostgreSQL datalake

An earlier version of the datalake had us syncing our MySQL database into a PostgreSQL database; first using fivetran, and when that proved to be too expensive because of wp_postmeta driving absurdly high monthly active rows, and then using pg_chameleon running on AWS ECS (more about that setup in another post). This allowed us to make use of the more advanced tooling for views and index management that PostgreSQL offers (compared to MySQL 5.7).

A big problem, however, was that the sync kept breaking. pg_chameleon tries very hard to extract constraints from the MySQL schema, and translate it into PostgreSQL schema constraints. But MySQL actually doesn't really enforce these schemas, while PostgreSQL does, which always led to dirty rows not making it, which was hard to debug.

Ultimately, the datalake was abandoned and we did most of our analytics with slow and verbose redash queries.

2 years later, I checked back in with dbt, and noticed that dbt-mysql seemed robust enough to give it another spin.

Migrating a dbt setup from PostgreSQL to MySQL

Migrating the dbt models from PostgreSQL to MySQL was actually easier than feared. Because of the heavy use of macros in the models, we had to adjust some of the typecasting macros mentioned above and not have to change models very much.

One big change was the lack of WITH statements, which required the creation of intermediate models. Thanks to the dependency computation that dbt does, that was just a matter of moving the WITH statements to their files. The added advantage is that I can now reuse the new models for queries down the road and run tests on them to ensure they are valid (in conjunction with the already written tests on the resulting model).

What used to be:

WITH line_items AS (
 	SELECT wwoi.order_item_id,
 		wwoi.order_id,
 		wwoi.order_item_name                               AS name,
 		{{ product_or_variation_id('wwoim_product_id.meta_value', 
 		                           'wwoim_variation_id.meta_value') }}      AS product_id,
               ...
 	FROM {{source('prod', 'wp_woocommerce_order_items')}}         wwoi
 	{{ order_item_meta_join('product_id') }}
 	{{ order_item_meta_join('variation_id') }}
       ...
 	LEFT JOIN {{source('prod', 'wp_posts')}} wp
 	   ON wp."ID" = wwoi.order_id
 	   AND wp.post_status = 'wc-completed'
 	ORDER BY wwoi.order_id DESC, wwoi.order_item_id)

 SELECT oli.*,
     p.current_price,
    ...
 FROM {{ref('line_items')}}                            oli
 LEFT JOIN {{ ref('products')}} p ON oli.product_id = p.product_id
 LEFT JOIN {{ ref('orders')}} o ON oli.order_id = o.order_id

is now:

SELECT oli.*,
     p.current_price,
    ...
 LEFT JOIN {{ ref('products')}} p ON oli.product_id = p.product_id
 LEFT JOIN {{ ref('orders')}} o ON oli.order_id = o.order_id

Performance issues

What cost us the most time, however, was getting the MySQL performance up to speed.

The core issue is that WordPress doesn't create a multicolumn index on wp_postmeta that allows one to do an efficient “double-field join”:

{% macro wp_meta_join(name, src) %}
LEFT JOIN {{source(var('source'), 'wp_postmeta')}} wpm_{{name}} 
       ON {{src}} = wpm_{{name}}.post_id
	   AND wpm_{{name}}.meta_key = '_{{name}}'
{% endmacro %}

In order to speed this operation up, the creation of the following indexes was necessary:

CREATE INDEX `bk1` ON `ttc_prod`.wp_postmeta (`post_id`,`meta_key`);
CREATE INDEX `bk1` ON `ttc_prod`.wp_woocommerce_order_itemmeta (`order_item_id`,`meta_key`);

There is a dedicated WordPress plugin for this as well: Index WP MySQL for Speed that Oliver Jones pointed out to me on Mastodon. He also directed me to more information about performance and index creation for WordPress installations.

(Caveat: I haven't dug too deep into MySQL performance, and the following is possibly wrong and/or naive).

The views that MySQL creates are unable to reuse multi-column indexes, which meant that most of our models needed manual index creation, which is only possible on an actual table. This meant that we had to materialize most of our views (not really an issue, ultimately, we only have small amounts of data). From our dbt_project.yml file:

      # we need to materialize these to have decent indexes
      customer_order_counts_per_year:
          materialized: table
      customer_order_counts_total:
          materialized: table
      customer_orders:
          materialized: table
      line_items:
          materialized: table

Furthermore, I had to add post-hooks entries to the individual models to create the necessary indexes, again reusing helper macros to avoid having to write too much manual SQL:

{% macro mysql_add_text_index(this, column, len)%}
create index {{ this.name }}__index_on_{{ column }} on {{ this }} ({{ column }}({{len}}))
{% endmacro %}
{% macro mysql_add_index(this, column)%}
create index {{ this.name }}__index_on_{{ column }} on {{ this }} ({{ column }})
{% endmacro %}
{% macro mysql_add_uindex(this, column)%}
alter table {{ this }} add unique index {{ this.name }}__index_on_{{ column }} ({{ column }})
{% endmacro %}
{{
config({
    "post-hook": [
      "{{ mysql_add_text_index(this, 'billing_email', 12)}}",
    ],
    })
}}

With these changes, most of our models could be computed quite rapidly.

However, one oddity remained. For a very specific materialized table (not even a big one), the following model using it would hang. I didn't have time to investigate this more closely, although I'm sure I would learn much about MySQL. I ended up adding a DO SLEEP(5) to the pre-hook of that model. This is really just like embedded programming, if you ask me.

{{ 
    config({
    "pre-hook": [
        "DO SLEEP(5)",
    ],
        "post-hook": [
            "{{ mysql_add_text_index(this, 'billing_email', 12) }}",
            "{{ mysql_add_index(this, 'fiscal_year') }}",
            "{{ mysql_add_index(this, 'previous_fiscal_year') }}",
        ]
    })
}}

Conclusion

With these changes, I could move our analytics pipeline to run directly against our production DB, with no performance impact. WordPress is already quite the rapacious beast in terms of performance; one more query run twice daily with 18 joins didn't really register.