Skip to main content

2 posts tagged with "data"

View All Tags

· 3 min read
Henri Blancke

Managing change is never easy, especially when it comes to data. As a data engineer, you're responsible for ensuring that your dbt projects are up-to-date and accurate. But with data constantly changing, attributes and columns changing, how do you keep up? And more importantly how do you inform your data users? In this guide, we'll explore the challenges of lifecycle management in dbt projects, go over what's good, what can be better and provide some practical tips and tricks to navigate them.

Let's start off with a quick overview of what lifecycle management is.

What is Data Lifecycle Management?

So, what's the deal with data lifecycle management?

It's the process of overseeing the entire lifespan of data, from creation to deletion. Think of it as a roadmap for your data's journey — it starts with data creation, moves through various stages like storage and processing, and eventually reaches its destination, whether that's analysis or retirement.

In this article we'll mostly focus on change management and data deprecation, the aspects of lifecycle management that affect your end-users and data consumers. Data consumers are usually happy to see new data being added to your data model or warehouse but less happy when it unexpectedly changes or disappears. This is why it's important to manage change with effective communication and transparency.

Consequences of Poor Data Lifecycle Management

Not being able to effectively communicate how your data is changing can lead to having a real impact on your business.

Inaccurate, inconsistent, or outdated data can lead to all sorts of problems, from flawed analyses to misguided decision-making. It's like trying to drive with a map that's missing half the roads — it's going to be a bumpy ride and not good for your business.

All off the above can lead to distrust in your data and the data team. This is why it's important to manage change effectively. If you're looking for an all in one solution to manage data docs, including effectively communicating change to your data users, check out refter.

Strategies for Managing Change in dbt Projects

Use dbt model versions

dbt supports versioning of models and transformations, enabling significant changes without disrupting downstream processes. However, it lacks robust features for notifying end-users of these changes.

To learn more about dbt model versions, check out the documentation here.

dbt model deprecation

dbt has a built-in deprecation feature that allows you to mark models as deprecated. It's a great way to communicate to your data users (mostly developers) that a model is going to be removed or changed in the future.

It's mostly intended to communicate that a model version will go away. Deprecation warnings only show up for dbt client users (developers) and not for end-users or stakeholders. That makes them almost invisible to the people that need to know about the changes the most. There's also no way to add additional information about the deprecation.

To add a deprecation warning to a model, add a deprecation_date key to your model properties file. Here's an example:

-- web_sessions.yml
version: 2

models:
- name: web_sessions
description: Number of web sessions per day
deprecation_date: 2024-02-12

refter model deprecation

refter enhances dbt model deprecation by allowing you to configure and append more information. It allows you to communicate changes to both your data users, stakeholders and developers and it will show up in your refter documentation site.

Model deprecation

version: 2
models:
- name: customers
config:
refter: # refter model properties
deprecated:
enabled: true
reason: "We're moving to a new data model"
date: 2024-02-12

Deprecated models will show up in the refter documentation data asset catalog.

model deprecation refter ui

and more information will be provided on the model detail page. model deprecation details refter ui

Column deprecation

refter also allows you to deprecate columns. This is a great way to communicate changes to your data users and stakeholders. It's a great way to communicate that a column is going to be removed in the future and point users to the new column.

version: 2
models:
- name: user_events
columns:
- name: id
data_type: int # refter uses dbt data types
tests:
- unique
- not_null
- name: source
data_type: int
refter: # refter column properties
deprecated:
enabled: true
reason: "This column is deprecated, use `order_id` instead"
date: "2021-01-01"

Column deprecation will show up in the refter data documentation site on the asset/model detail page:

column deprecation details refter ui

Conclusion

dbt offers you some options, but mostly for developers. refter enhances dbt model deprecation by allowing you to configure and append more information. It allows you to communicate changes to both your data users, stakeholders and developers and it will show up in your refter documentation site. This is a great way to manage change in your dbt projects and effectively communicate with your data users and stakeholders.

If you're interested in trying out refter, sign up here for a 30 day free trial.

· 6 min read
Henri Blancke

In this guide, we'll walk you through everything you need to know about data dictionaries, why they matter, how to build them, how to maintain them and how to auto-generate and kick start them for dbt projects. Whether you're a data architect, data engineer or data analyst, chances are you've had questions about a dataset you're working with that you wish were documented, or ran into fields with ambiguous names or values and you have to go find someone to explain them to you... Let's start changing that ambiguity and confusion starting today!

Understanding Data Dictionaries

What is a Data Dictionary?

Think of data dictionaires as your data's best friend, the ultimate tour guide through the maze of information.

A data dictionary is a comprehensive catalog that describes the various data elements in your dataset. It tells you what each piece of data means, where it comes from, and how it relates to other data elements.

In simple terms, it's there to provide context to end users and potentially other systems that consume your data.

Why is Data Documentation Important?

Documenting everything about your data may seem like a huge time commitment, and it can be, but we advise to start small.

Start by documenting what's most confusing, most important to your company or business and grow things out from there. Without proper documentation, your data becomes a messy pile of jumbled information, causing confusion, errors, and headaches for everyone involved.

Data documentation and dictionaries help data consumers navigate the complexity of the data, self-serve and feel more confident in the results of their analysis and research.

Dedicating some time up front to document your data, or making it a required part of your development process, can help save you a ton of time in the future. It'll help data consumers understand and use your data correctly and you'll get less questions. Stay tuned for some tips on how to make the whole process take less time with some handy dbt tools.

Building Your Data Dictionary

Start by identifying key data elements and documenting crucial information such as definitions, data types, and relationships. It's a team effort but it can be done incrementally and iteratively. Consider inviting feedback, collaborating with everyone involved in your team who holds the key the knowledge and pick up on common/recurring questions that you should start including in your dictionary. Slowly but surely your data dictionary will grow and become a valuable asset for your team.

So, how do you go about building and maintaining your data dictionary? Good news, some things can be automated 🎉! Not everything though, but some things...

Use Tooling

dbt-invoke is handy little utility that auto-generates properties.yml files for your dbt models. It'll help you generate a lot of the boilerplate and information you need to kickstart your data dictionary with a few simple commands. dbt-invoke uses the dbt manifest and your database's information schema to generate or update the properties.yml files. One caveat is that you have to run your dbt models first before dbt-invoke can identify all information about your schemas and tables.

dbt-invoke will retain all information about your models and columns, including data types, descriptions, tests, etc. every time you run it. It'll automatically remove and add any columns that were added or updated during the development process. It's a great way to not only kickstart but also maintain your data dictionary.

info

dbt-invoke is a community project and not officially supported by dbt but it's a great tool and we recommend you give it a try.

To get started with dbt-invoke install it with pip:

pip install dbt-invoke

Then navigate to your dbt project and use the dbt-invoke client similar to the dbt client. dbt-invoke has a very similar interface as dbt:

dbt-invoke properties --select model.to.select

dbt-invoke creates a properties file per model, it helps keep things modular and manageable. After auto-generating the files, start adding quality descriptions, tests, column descriptions, etc. This is a great way to get started with your data dictionary, it gives you a solid starting point from which you can start iterating and improving.

One downside is that dbt-invoke doesn't support inferring data types from your database, so you'll have to add those manually. But it's a great way to get started and it'll save you a lot of time.

tip

Start by documenting the models and columns that are most important or confusing to your data consumers. This will help you get the most value out of your data dictionary as quickly as possible.


note

If you want to take it a step further, use refter to document entity relations right inside your dbt project using your newly created properties files. This'll further enrich your documentation and drastically improve usability of your data for your end-users. refter can also automate hosting and managing your documentation for the ultimate self-serve data experience.

refter-entity-realtions


Collaboration is Key

Building a data dictionary is no solo mission. It's a team effort, a collaborative endeavor that requires input from everyone involved in the data ecosystem. So, don't be afraid to reach out to your fellow data enthusiasts for help and feedback.

Having your documentation version controlled is another great way to collaborate. Not only does it help monitor change over time but also invites others (even cross-team) to easily contribute to your documentation. Together, you'll create a data dictionary that's not only comprehensive but also tailored to the unique needs of your organization.

Leveraging Data Dictionaries for Data Understanding

The Power of Data Exploration

Ah, data exploration — the thrill of uncovering hidden insights buried within your dataset. With a trusty data dictionary by your side, data exploration becomes a breeze. Say goodbye to aimless wandering and hello to targeted discovery. Simply consult your data dictionary/documentation to gain a deeper understanding of your data's meaning and structure. It's like having a treasure map that leads you straight to the gold!

Making Informed Decisions

In today's fast-paced world, making informed decisions is more important than ever. Luckily, data dictionaries are here to help. By providing clear definitions and context for your data, data dictionaries empower you to make informed decisions with confidence. Whether you're crunching numbers, analyzing trends, or forecasting the future, a well-documented data dictionary is your secret weapon for success.

Best Practices for Using Data Dictionaries

Keep It Up-to-Date

A data dictionary is not a set-it-and-forget-it kind of deal. It's a living, breathing document that evolves with your data. So, make sure to keep it up-to-date with the latest information. As your data changes and grows, so too should your data dictionary.

Regularly review and revise your documentation to ensure accuracy and relevance. And don't forget dbt-invoke can help you there too! Run dbt-invoke after every update or change to keep your properties.yml files up-to-date. You could take it a step further and automate things with a CI/CD pipeline and set up some checks to make sure your documentation is always up-to-date. Check out our CI/CD guide for more information on how to automatically deploy your docs to refter.

Use Version Control

Version control is your best friend when it comes to data documentation. By keeping your data dictionary under version control, you can easily track changes, collaborate with others, and maintain a clear history of and source of truth for your documentation. Plus, if something goes awry, you can always roll back to a previous version. It's like a safety net for your data documentation.

Share and Collaborate

Don't be a data hoarder! Share your data docs far and wide with anyone who might benefit from it. Encourage collaboration and feedback from your fellow data enthusiasts. After all, the more eyes on your data dictionary, the better it becomes. So, spread the data love and watch as your data ecosystem flourishes.

Tailor or Curate for Your Audience

Not all data consumers are created equal. Some teams may be rooted in SQL and very hungry for data, others may need more guidance. Tailor your data dictionary to suit the unique needs of your audience.

  • Provide clear, concise explanations for beginners, and dive deep into the nitty-gritty details for the experts.
  • Provide visual aids, for example create ERDs and visualize them in a graph, show lineage and dependencies.

By curating your data dictionary for your audience and managing who can see what, you'll ensure that everyone can make the most of your data.

The Future is Bright

As we gaze into the crystal ball of data documentation, the refter team sees a future filled with innovation and possibility. As we build out our data catalog/documentation tool, we are thinking about automated documentation tools, AI-driven metadata management — these are just a taste of what's to come. The future of data dictionaries is brighter than ever, and we can't wait to see where it takes us.

Conclusion

Congratulations, data enthusiasts! You've unlocked the power of your data with the ultimate guide to building and using data dictionaries. Armed with this newfound knowledge, you're ready to take on the data world like never before. So, go forth and document your data with confidence. Your team is going to love you for it.

And remember, if you ever need a helping hand or a friendly nudge in the right direction, don't hesitate to reach out. We're all in this together, on a quest for data greatness. So, until next time, happy data documenting!