1. Overview
dplyr package
is the most powerful package for data handling in R, and it has also the ability of working with databases(See Vignette).
But the functionalities of dealing with databases in
dplyr
is begin developped even now.Now, I'm trying to make
dplyr
with databases more comfortable by some functions.For that purpose, I've created
dplyrr
package.
dplyrr
has below functions:
load_tbls()
: Easy to load table objects for all tables in a database.cut()
inmutate()
: Easy to create a case statement by using the grammar like thebase::cut()
.count_if()
andn_if()
insummarise()
: Shortcut to count rows that a condition is satisfied.filter()
: Improvedfilter()
fortbl_sql
which adds parentheses appropriately.moving_mean()
inmutate()
: Compute moving average for PostgreSQL.moving_max()
inmutate()
: Compute moving max for PostgreSQL.moving_min()
inmutate()
: Compute moving min for PostgreSQL.moving_sum()
inmutate()
: Compute moving sum for PostgreSQL.first_value()
inmutate()
: Compute first value for PostgreSQL.
2. How to install
The source code fordplyrr
package is available on GitHub at
You can install the pakage from there.
install.packages("devtools") # if you have not installed "devtools" package
devtools::install_github("hoxo-m/dplyrr")
3. Common functions for all databases
For illustration, we use a database file: "my_db.sqlite3".If you want to trace the codes below, you should create the databese file at first.
library(dplyrr)
library(nycflights13)
db <- src_sqlite("my_db.sqlite3", create = TRUE)
copy_nycflights13(db)
3-1. load_tbls()
Usually, when we use a database withdplyr
, we first create database object, and we can see the tables in the databese by show()
.
library(dplyrr)
# Create database object
db <- src_sqlite("my_db.sqlite3")
show(db)
## src: sqlite 3.8.6 [my_db.sqlite3] ## tbls: airlines, airports, flights, planes, sqlite_stat1, weatherNext, we create table objects for pulling data in some tables in the database.
airlines_tbl <- tbl(db, "airlines")
airports_tbl <- tbl(db, "airports")
flights_tbl <- tbl(db, "flights")
planes_tbl <- tbl(db, "planes")
weather_tbl <- tbl(db, "weather")
Typing this code is really a bore!
If you want to create table objects for all tables in the database, you can use
load_tbls()
.
load_tbls(db)
## Loading: airlines_tbl ## Loading: airports_tbl ## Loading: flights_tbl ## Loading: planes_tbl ## Loading: sqlite_stat1_tbl ## Loading: weather_tblCheck the created table objects.
ls(pattern = "_tbl$")
## [1] "airlines_tbl" "airports_tbl" "flights_tbl" ## [4] "planes_tbl" "sqlite_stat1_tbl" "weather_tbl"
glimpse(airlines_tbl)
## Observations: 16 ## Variables: ## $ carrier (chr) "9E", "AA", "AS", "B6", "DL", "EV", "F9", "FL", "HA", ... ## $ name (chr) "Endeavor Air Inc.", "American Airlines Inc.", "Alaska...
3-2. cut() in mutate()
If you want to write case statement with likebase::cut()
, you can use cut()
function in mutate()
.
For example, there is
air_time
column in the database.
db <- src_sqlite("my_db.sqlite3")
flights_tbl <- tbl(db, "flights")
q <- flights_tbl %>% select(air_time)
air_time <- q %>% collect
head(air_time, 3)
## Source: local data frame [3 x 1] ## ## air_time ## 1 227 ## 2 227 ## 3 160If you want to group the
air_time
by break points c(0, 80, 120, 190, 900)
, you think you must write the next code.
q <- flights_tbl %>%
select(air_time) %>%
mutate(air_time_cut = if(air_time > 0 && air_time <= 80) "(0,80]"
else if(air_time > 80 && air_time <= 120) "(80,120]"
else if(air_time > 120 && air_time <= 190) "(120,190]"
else if(air_time > 190 && air_time <= 900) "(190,900]")
air_time_with_cut <- q %>% collect
head(air_time_with_cut, 3)
## Source: local data frame [3 x 2] ## ## air_time air_time_cut ## 1 227 (190,900] ## 2 227 (190,900] ## 3 160 (120,190]When the break points increase, you are going to be tired to write more lines.
By using
cut()
function in mutate()
, it can become easy.
q <- flights_tbl %>%
select(air_time) %>%
mutate(air_time_cut = cut(air_time, breaks=c(0, 80, 120, 190, 900)))
air_time_with_cut <- q %>% collect
head(air_time_with_cut, 3)
## Source: local data frame [3 x 2] ## ## air_time air_time_cut ## 1 227 (190,900] ## 2 227 (190,900] ## 3 160 (120,190]The
cut()
in mutate()
has more arguments such as labels coming from base::cut()
.
cut(variable, breaks, labels, include.lowest, right, dig.lab)
For integer break points, specially you can indicate
labels="-"
.
q <- flights_tbl %>%
select(air_time) %>%
mutate(air_time_cut = cut(air_time, breaks=c(0, 80, 120, 190, 900), labels="-"))
air_time_with_cut <- q %>% collect
head(air_time_with_cut, 3)
## Source: local data frame [3 x 2] ## ## air_time air_time_cut ## 1 227 191-900 ## 2 227 191-900 ## 3 160 121-190
3-3. count_if() and n_if() in summarise()
When we want to count rows that condition is satisfied, we might write like this.q <- flights_tbl %>%
select(air_time) %>%
summarise(odd_airtime_rows = sum(if(air_time %% 2 == 1) 1L else 0L),
even_airtime_rows = sum(if(air_time %% 2 == 0) 1L else 0L),
total_rows=n())
q %>% collect
## Source: local data frame [1 x 3] ## ## odd_airtime_rows even_airtime_rows total_rows ## 1 164150 163196 336776The
count_if()
and n_if()
functions are a shortcut for it merely.
count_if(condition)
n_if(condition)
q <- flights_tbl %>%
select(air_time) %>%
summarise(odd_airtime_rows = count_if(air_time %% 2 == 1),
even_airtime_rows = n_if(air_time %% 2 == 0),
total_rows=n())
q %>% collect
## Source: local data frame [1 x 3] ## ## odd_airtime_rows even_airtime_rows total_rows ## 1 164150 163196 336776Both functions do exactly the same thing.
3-4. Improved filter()
If you usedplyr
with databases in pure mind, you can encounter the unintended action like below.
library(dplyr)
db <- src_sqlite("my_db.sqlite3")
flights_tbl <- tbl(db, "flights")
q <- flights_tbl %>%
select(month, air_time) %>%
filter(month == 1) %>%
filter(air_time > 200 || air_time < 100)
q$query
## <Query> SELECT "month" AS "month", "air_time" AS "air_time" ## FROM "flights" ## WHERE "month" = 1.0 AND "air_time" > 200.0 OR "air_time" < 100.0 ## <SQLiteConnection>Did you expect the WHERE clause to be that?
If you use
dplyrr
, it becomes natural by adding parentheses.
library(dplyrr)
db <- src_sqlite("my_db.sqlite3")
flights_tbl <- tbl(db, "flights")
q <- flights_tbl %>%
select(month, air_time) %>%
filter(month == 1) %>%
filter(air_time > 200 || air_time < 100)
q$query
## <Query> SELECT "month" AS "month", "air_time" AS "air_time" ## FROM "flights" ## WHERE ("month" = 1.0) AND ("air_time" > 200.0 OR "air_time" < 100.0) ## <SQLiteConnection>
4. Functions for PostgreSQL
4-1. moving_**() in mutate()
dplyrr
has four moving_**()
functions that you can use in mutate()
.
moving_mean(variable, preceding, following)
moving_max(variable, preceding, following)
moving_min(variable, preceding, following)
moving_sum(variable, preceding, following)
preceding
and following
, you can omit following
.
For illustration, we use the test database that is PostgreSQL.
srcs <- temp_srcs("postgres")
df <- data.frame(x = 1:5)
tbls <- dplyr:::temp_load(srcs, list(df = df))
temp_tbl <- tbls$postgres$df
head(temp_tbl)
## x ## 1 1 ## 2 2 ## 3 3 ## 4 4 ## 5 5Compute moving average with 1 preceding and 1 following.
q <- temp_tbl %>%
mutate(y = moving_mean(x, 1))
q %>% collect
## Source: local data frame [5 x 2] ## ## x y ## 1 1 1.5 ## 2 2 2.0 ## 3 3 3.0 ## 4 4 4.0 ## 5 5 4.5Comfirm query.
q$query
## <Query> SELECT "x", "y" ## FROM (SELECT "x", avg("x") OVER (ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS "y" ## FROM "tlsqbjsuou") AS "_W1" ## <PostgreSQLConnection:(10316,0)>Compute moving mean with 1 preceding and 2 following.
q <- temp_tbl %>%
mutate(y = moving_mean(x, 1, 2))
q %>% collect
## Source: local data frame [5 x 2] ## ## x y ## 1 1 2.0 ## 2 2 2.5 ## 3 3 3.5 ## 4 4 4.0 ## 5 5 4.5Comfirm query.
q$query
## <Query> SELECT "x", "y" ## FROM (SELECT "x", avg("x") OVER (ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) AS "y" ## FROM "tlsqbjsuou") AS "_W2" ## PostgreSQLConnection:(10316,0)>Similary, you can use the other
moving_**()
functions.
4-2. first_value() in mutate()
dplyrr
has first_value()
function that you can use in mutate()
.
first_value(value, order_by)
value
and order_by
, you can omit order_by
.
For illustration, we use the test database that is PostgreSQL.
srcs <- temp_srcs("postgres")
df <- data.frame(class = c("A", "A", "B", "B", "C", "C"), x = 1:6, y = 6:1)
tbls <- dplyr:::temp_load(srcs, list(df=df))
temp_tbl <- tbls$postgres$df
head(temp_tbl)
## class x y ## 1 A 1 6 ## 2 A 2 5 ## 3 B 3 4 ## 4 B 4 3 ## 5 C 5 2 ## 6 C 6 1Get the first values of x partitioned by class and ordered by x.
q <- temp_tbl %>%
group_by(class) %>%
mutate(z = first_value(x))
q %>% collect
## Source: local data frame [6 x 4] ## Groups: class ## ## class x y z ## 1 A 1 6 1 ## 2 A 2 5 1 ## 3 B 3 4 3 ## 4 B 4 3 3 ## 5 C 5 2 5 ## 6 C 6 1 5See query.
q$query
## <Query> SELECT "class", "x", "y", "z" ## FROM (SELECT "class", "x", "y", first_value("x") OVER (PARTITION BY "class" ORDER BY "x") AS "z" ## FROM "slrhxfdvrt") AS "_W3" ## <PostgreSQLConnection:(10316,0)>Get the first values of x partitioned by class and ordered by y.
q <- temp_tbl %>%
group_by(class) %>%
mutate(z = first_value(x, y))
q %>% collect
## Source: local data frame [6 x 4] ## Groups: class ## ## class x y z ## 1 A 2 5 2 ## 2 A 1 6 2 ## 3 B 4 3 4 ## 4 B 3 4 4 ## 5 C 6 1 6 ## 6 C 5 2 6See query.
q$query
## <Query> SELECT "class", "x", "y", "z" ## FROM (SELECT "class", "x", "y", first_value("x") OVER (PARTITION BY "class" ORDER BY "y") AS "z" ## FROM "slrhxfdvrt") AS "_W4" ## <PostgreSQLConnection:(10316,0)>Get the first values of x partitioned by class and ordered by descent of y.
q <- temp_tbl %>%
group_by(class) %>%
mutate(z = first_value(x, desc(y)))
q %>% collect
## Source: local data frame [6 x 4] ## Groups: class ## ## class x y z ## 1 A 1 6 1 ## 2 A 2 5 1 ## 3 B 3 4 3 ## 4 B 4 3 3 ## 5 C 5 2 5 ## 6 C 6 1 5See query.
q$query
## <Query> SELECT "class", "x", "y", "z" ## FROM (SELECT "class", "x", "y", first_value("x") OVER (PARTITION BY "class" ORDER BY "y" DESC) AS "z" ## FROM "slrhxfdvrt") AS "_W5" ## <PostgreSQLConnection:(10316,0)>
5. Miscellaneous
update_dplyrr()
update_dplyrr()
is a shortcut/syntax sugar of
devtools::install_github("hoxo-m/dplyrr")
unload_dplyrr()
unload_dplyrr()
is a shortcut/syntax sugar ofdetach("package:dplyrr", unload = TRUE)
detach("package:dplyr", unload = TRUE)
hi, thanks for this! you might have a great answer: i'm curious why you've done all this work independently as an extension instead of sending Hadley pull requests on github and simply improving dplyr itself?
返信削除Yes, I sent PR. See https://github.com/hadley/dplyr/pull/1225
削除