Extending dbt#

As noted in data build tool (dbt), you can install your own custom dbt package to apply your own transforms to the event data in Aspects.

This guide demonstrates how to create and use a custom dbt package in Aspects by building the aspects-dbt-example repo. See aspects-dbt-example#1 to follow along with each of these steps.

See Building dbt packages and Best practice guides for dbt’s official documentation.

Step 0. Install dbt core#

The easiest way to install dbt core is to use pip in a python3 virtual environment.

See aspects-dbt requirements.txt for the specific package versions used by Aspects.

# Create and activate a python3 virtual environment
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip

# Install the dbt package versions used by Aspects
pip install dbt-clickhouse==1.7.2 dbt-core==1.7.0

See Install dbt for more ways to install dbt.

Step 1. Initialize your dbt package#

Create a new dbt package by following the prompts given by the dbt init tool.

dbt init
# Enter a name for your project (letters, digits, underscore): aspects_dbt_example
# Which database would you like to use? [1] clickhouse

ls aspects_dbt_example
# analyses/  dbt_project.yml  macros/  models/  README.md  seeds/  snapshots/  tests/

See About dbt init for more options.

Step 2. Use the Aspects profile#

Update the generated dbt_project.yml to use the aspects profile:

# This setting configures which "profile" dbt uses for this project.
profile: 'aspects'

Also set the location for compiled SQL to match that used by aspects:

# directory which will store compiled SQL files
target-path: "target"

Step 4. Add your custom transforms#

Here is where you will need an understanding of dbt, Clickhouse, Aspects’ data schemas, and the specific transforms you want to create.

If you need any python dependencies beyond what is provided by aspects-dbt, add these to a requirements.txt file at the top level of your repository.

Note

You can use Aspects to debug your custom SQL:

  1. Login to Superset as an Open edX superuser.

  2. Using the menus at the top of the page, navigate to the “SQL -> SQL Lab” UI.

  3. Browse the schemas and run read-only SQL queries on your data.

For this tutorial, we added a new model which will be materialized by dbt into a view in Clickhouse. Our new model calculates the average number of attempts made by users on each problem by referencing the int_problem_results model created by the base aspects-dbt package (see dbt ref):

select
    problem_id, AVG(attempts) as average_attempts
from
    (
        select
            problem_id,
            max(attempts) as attempts
        from
            {{ ref('int_problem_results') }}
        group by
            actor_id,
            problem_id
    )
group by problem_id

Next, make sure your model is configured, for example in the db_project.yml. If you forget this step, dbt will warn you when running your package.

models:
 problem_responses:
   # Config indicated by + and applies to all files under models/problem_responses/
   +materialized: view

See Configure dbt models to learn more.

Step 5. Add tests#

Writing tests for your transforms is important: not only can tests validate and document your intended changes, they can be used to guard against data edge cases and regressions from future code changes.

dbt generic tests are defined as SQL files, where the goal of the SQL statement is to return zero records.

Because our new average_attempts model aggregates on actor_id and problem_id, it should only have 1 entry for each problem_id. So our test can be:

-- average_attempts should only have one record for each problem_id.
select
    count(*) as num_rows
from
    {{ ref('average_attempts') }}
group by
    problem_id
having num_rows > 1

See Writing data tests for more examples.

Step 6. Install and use your dbt package#

Once you’ve pushed all the changes to your custom dbt package repo, now we’re ready to use it.

Use tutor config save to update the following Tutor variables to use your custom package instead of the Aspects default.

  • DBT_REPOSITORY: A git repository URL to clone and use as the dbt project.

    Set this to the URL for your custom dbt package.

    Default: https://github.com/openedx/aspects-dbt

  • DBT_BRANCH: The branch to use when cloning the dbt project.

    Set this to the hash/branch/tag of the custom dbt package that you wish to use.

    Default: varies between versions of Aspects.

  • EXTRA_DBT_PACKAGES: Add any python packages that your dbt project requires here.

    Default: []

  • DBT_PROFILE_*: variables used in the Aspects dbt/profiles.yml file, including several Clickhouse connection settings.

  • DBT_SSH_KEY: The private SSH key to use when cloning the dbt project. Only necessary if you are using a private repository.

Once your package is configured in Tutor, you can run dbt commands directly on your deployment.

See dbt commands for a full list of available commands.

# Build and test your package
tutor dev do dbt -c "build"

# Deploy your customizations
tutor dev do dbt -c "run"

# Run tests on the data
tutor dev do dbt -c "test"

Step 7. Troubleshooting#

You may need to repeat steps 4-6 a few times to resolve any warnings or errors that dbt reports with your package.

Don’t forget to push your changes to your repo before running the tutor dbt command: it fetches a clean copy of your configured package repo + branch each time it runs.

See dbt debugging and The missing guide to debug() in dbt for more information on how to debug issues with your package.

References#