Try using dbtplyr

293 Views

December 06, 24

#dbt #dplyr #SQL #データモデリング #データエンジニアリング

スライド概要

#Japan.R 2024 のLT資料の英語翻訳バージョン
https://japanr.connpass.com/event/334154/

小坪琢人

@airspace_nobo

スライド一覧

Data Scientist

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

dbtplyr を使ってみる

小坪琢人 10.9K

猫でも分かるUnreal Engineの学び方 - 超初心者向け編 - 2023 v1.0

ue4 ue5 ue-beginner

エピックゲームズジャパン 1.5M

Unreal Engine5 Lumenの仕組みと肝心なところ

ue5 ue-rendering ue-lumen

エピックゲームズジャパン 1.3M

Meta XR SDK(V66-74)でQuestアプリを開発

spatial anchor unity quest pro shaperecognizeractivatestate oculus integration transformfeaturestateprovider building blocks transformrecognizeractivestate ovrsemanticclassification jointdeltaprovider ovrscenemanager jointvelocityactivestate オクルージョン sequenceactivestate scene manager ambisonic depth api metaxraudiosource playerlocomotor meta xr sdk quest3 ovrplayercontroller マルチモーダル meta haptics studio direct touch ui meta xr haptics sdk ovrspatialanchor ovrtrackedkeyboard hapticclipplayer fingerfeaturestateprovider hapticclip ワイドモーションモード wmm mruk mr utility kit voice sdk jointrotationactivestate meta horizon os ui set asw application spacewarp ovr metrics tool unityscene manager colocation discovery コロケーション mx ink passthrough camera api hand tracking microgestures webcamtexturemanager passthroughcamerautils cameraviewermanager hand pose selector recorder

あうぜん 1M

猫でも分かる UE5.0, 5.1 におけるアニメーションの新機能について【CEDEC+KYUSHU 2022】

ue5 cedec+kyushu ue-animation ue-optimize ue-bp ue-physics ue-sequencer

エピックゲームズジャパン 0.9M

各ページのテキスト

Try using dbtplyr 2024/12/07 Japan.R 2024 (#JapanR) Takuto Kotsubo (@airspace_nobo) It has been automatically translated. Sorry if there are any inaccurate expressions.

https://x.com/airspace_nobo

Self Introduction Have been working as a data scientist for ﬁve years now ❏ Recently working in Analytics Engineering team to improve data modeling ❏ Participating in today's event as a staff member of the venue Hobbies ❏ Watching comedy, Futsal

https://speakerdeck.com/kazaneya/20240116

Memories Past presentations at Tokyo. ❏ Use R from python @Tokyo. ❏ How to predict the wind direction of tomorrow @Tokyo. This is the second time since Tokyo.R #71 that I have given a presentation in front of Hadley wickham.

To the main topic. ❏ Introducing a package that incorporates dplyr notation into a modern data pipeline tool called dbt, which I have recently started using in my work. I think there are many people who have never seen dbt before, so I hope you can understand what it can do!

What we want to tell you today ❏ dbt SQL can be written in a dplyr-like way ❏ Combined with Dbt , preprocessing written in R may be replaced with SQL? ❏ dplyr is still amazing.There are endless possibilities.

Presentation Flow ❏ Introduction to data modeling ❏ Introduction to dbt ❏ What can dbt solve? ❏ What can dbt solve when writing SQL? ❏ What can be solved with macro ❏ What can be solved with dbtplyr ❏ Summary

What is Data Modeling? Data modeling is the process of designing and building logical structures to organize data and make it easier to analyze ❏ Improve the speed and reduce the effort of analysis and querying. Centralize management of intermediate processes Example of a not-so-good situation ❏ All users are individually performing pre-processing->totalization->visualization, etc. on information similar to raw data.

What is Data Modeling? Improved condition ❏ Pre-processing common to all users is centralized ❏ No need to follow secret sauce-like information when querying Use here raw data Preprocessi ng/Transfor mation Clean Tables

What is dbt? Code that combines SQL and Jinja, a template language used in Python ❏ Easy to implement data modeling and full documentation and visualization of the process ❏ There is a function called "macro", which allows for processing that would be difﬁcult to perform using SQL alone.

10.

What is dbt? ❏ ❏ When you create the query shown on the left as stg_orders.sql and run dbt, the appropriate DB and table are referenced from the conﬁguration. From the query, you can visualize the dependency of the process "stg_orders was created by referencing raw_orders". select id as order_id, user_id as customer_id, order_date, status from {{ source('main', 'raw_orders') }} select id as order_id, user_id as customer_id, order_date, status from "tutorial"."main"."raw_orders"

11.

What is dbt? Preprocessing and joins of other tables in the same PROJECT can make a nice relationship chart. ❏ ❏ Convert raw_payments to stg_payments in the same way Create a table named orders by joining stg_orders and stg_payments

12.

What can dbt solve? With dbt, you can easily implement data modeling based on SQL. ❏ On the other hand, dbt has a feature called "macro" for processing that is difﬁcult to do using SQL alone. ❏ dbtplyr is a package that allows you to use dplyr-like notation in SQL.

13.

[beta]

Concerns when writing SQL
❏ So it is necessary to write repeatedly when processing multiple
columns in a regular manner, since there is no function.
❏ Repetition also increases the possibility of typos (e.g., mistakenly
writing "num" as "nu").
select
avg(num_a) as avg_num_a,
avg(num_b) as avg_num_b,
avg(num_c) as avg_nu_c,
from
data

data |>
summarise(across(starts_with('num'),
~mean(.x, na.rm=TRUE),
.names = "avg_{.col}"))

14.

What can be solved with macro ❏ When you want to aggregate the data horizontally for each payment_method, you need to write it repeatedly in SQL. with order_payments as ( select order_id, sum(case when payment_method = 'credit_card' then amount else 0 end) as credit_card_amount, sum(case when payment_method = 'coupon' then amount else 0 end) as coupon_amount, sum(case when payment_method = 'bank_transfer' then amount else 0 end) as bank_transfer_amount, sum(case when payment_method = 'gift_card' then amount else 0 end) as gift_card_amount, sum(amount) as total_amount

15.

What can be solved with macro ❏ dbt's macro can be used to generalize for loop and processing {% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %} with order_payments as ( select order_id, {% for payment_method in payment_methods -%} sum(case when payment_method = '{{ payment_method }}' then amount else 0 end) as {{ payment_method }}_amount, {% endfor -%} sum(amount) as total_amount

16.

What can be solved with macro ❏ macro as a function and can be used in other queries (e.g. macro/util.sql) {% macro get_payment_methods() %} ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] {% endmacro %} {% macro calculate_method_amounts(payment_methods, amount_column='amount', payment_method_column='payment_method') %} {% for payment_method in payment_methods -%} sum(case when {{ payment_method_column }} = '{{ payment_method }}' then {{ amount_column }} else 0 end) as {{ payment_method }}_amount, {% endfor -%} {% endmacro %}

17.

What dbtplyr can solve ❏ dbtplyr is a package designed to implement dplyr-like notation in dbt macro. ❏ A package developed by Emily Riederer , whose X proﬁle also conveys her love of R "Three R's in my last name, but it's not enough #rstats for me!"

https://x.com/emilyriederer

18.

What dbtplyr can solve ❏ Use across to summarize recurring totals {% set cols = dbtplyr.get_column_names( ref('customers') ) %} {% set cols_num = dbtplyr.starts_with('num_', cols) %} select customer_id, {{ dbtplyr.across(cols_num, "sum({{var}}) as {{var}}_tot") }} from {{ ref('customers') }} group by 1 select customer_id, sum(num_total_orders) as num_total_orders_tot, sum(num_total_amount) as num_total_amount_tot from "tutorial"."main"."customers" group by 1

19.

What dbtplyr can solve ❏ Use across to summarize repetitive conversions {% set cols = dbtplyr.get_column_names( ref('customers') ) %} {% set cols_date = dbtplyr.starts_with('date_', cols) %} select customer_id, {{ dbtplyr.across(cols_date, "cast({{var}} as timestamp) as {{var}}")}} from {{ ref('customers') }} select customer_id, cast(date_first_order as date) as date_first_order, cast(date_most_recent_order as date) as date_most_recent_order from "tutorial"."main"."customers"

20.

[beta]

What dbtplyr can solve
❏ Can be used in combination with for loop for where condition
{% set cols_date =
dbtplyr.starts_with('date_', cols) %}
...
where
{% for col in cols_date -%}
{{col}} >= cast('2018-02-01' as date)
or
{% endfor %}
FALSE

...
where
date_first_order >=
cast('2018-02-01' as date) or
date_most_recent_order >=
cast('2018-02-01' as date) or
FALSE

21.

SUMMARY ❏ The use of Data Engineering (dbt) × Data Science (dplyr) was introduced. ❏ The process used in SQL is simple and convenient, so dplyr may become popular in the dbt world. ❏ I will write a bit more about it in dbt Advent Calendar 2024 including the dbt installation process!

https://qiita.com/advent-calendar/2024/dbt

22.

bibliography dbtplyr: Bringing Column-Name Contracts from R to dbt posit::conf(2023) What, exactly, is dbt? Apply a function (or functions) across multiple columns

23.

Appendix: Source Code ❏ https://github.com/takuto0831/japanr_dbtplyr

https://github.com/takuto0831/japanr_dbtplyr

24.

appendix: targets package ❏ Rが生産性を高める〜データ分析ワークフロー効率化の実践 and others have discussed the data pipeline tool called targets. ❏ Data pipeline + visualization with ggplot and other outputs can be incorporated into the pipeline.

https://gihyo.jp/book/2022/978-4-297-12524-0