2016/12/24

New R Package magicfor - Magic Functions to Obtain Results from for Loops in R

What is inconvenience of for loops in R? It is that results you get will be gone away. So we have created a package to store the results automatically. To do it, you only need to cast one line spell magic_for(). In this text, we tell you about how to use the magic.

1. Overview

for() is one of the most popular functions in R. As you know, it is used to create loops.

For example, let’s calculate squared values for 1 to 3.

for (i in 1:3) {
  squared <- i ^ 2
  print(squared)
}
#> [1] 1
#> [1] 4
#> [1] 9

It is very easy.

However, it becomes too much hassle to change such codes to store printed results. You must prepare some containers with correct length for storing results and change print() to assignment statements.

result <- vector("numeric", 3) # prepare a container
for (i in 1:3) {
  squared <- i ^ 2
  result[i] <- squared         # change to assignment
}
result
#> [1] 1 4 9

Moreover, you may want to store results as a data.frame with iteration numbers.

result <- data.frame(matrix(nrow = 3, ncol = 2))
colnames(result) <- c("i", "squared")
for (i in 1:3) {
  squared <- i ^ 2
  result[i, 1] <- i
  result[i, 2] <- squared
}
result
#>   i squared
#> 1 1       1
#> 2 2       4
#> 3 3       9

What a bother!

In such or more troublesome situations like that you have to store many variables, codes will grow more complex.

The magicfor package makes to resolve the problem being kept readability.

You just add two lines before for loops. First, load the library. Second, call magic_for(). Notice that the main for loop is kept intact.

library(magicfor)  # Load library
magic_for(print)   # Call magic_for()

for (i in 1:3) {
  squared <- i ^ 2
  print(squared)
}
#> The loop is magicalized with print().
#> [1] 1
#> [1] 4
#> [1] 9

magic_for() takes a function name, and reconstructs for() to remember values passed to the specified function in for loops. We call it magicalization. Once you call magic_for(), as you just exectute for() as usual, the result will be stored in memory automatically.

Here, let’s use magic_result_as_vector() to access the stored values.

magic_result_as_vector()  # Get the result
#> [1] 1 4 9

This is one of the functions to obtain results from magicalized for loops, and means to take out the results as a vector.

Even if the number of observed variables increases, you can do it the same way.

magic_for(silent = TRUE)

for (i in 1:3) {
  squared <- i ^ 2
  cubed <- i ^ 3
  put(squared, cubed)
}

magic_result_as_dataframe()
#>   i squared cubed
#> 1 1       1     1
#> 2 2       4     8
#> 3 3       9    27

put() is the default function to store values in magicalized for loops. It allows to take any number of variables and can display them.

2. Installation

You can install the magicfor package from CRAN.

install.packages("magicfor")

The source code for magicfor package is available on GitHub at

3. Details

The magicfor package provides the functions as follows:

  • magic_for(): Magicalize for.
  • magic_free(): Cancel magicalization.
  • Get results:
    • magic_result(): as a list.
    • magic_result_as_vetor(): as a vector.
    • magic_result_as_dataframe(): as a data.frame.
  • put(): Display values.

In the following, we assume that the library is loaded to use the functions.

library(magicfor)

3.1 Basics

The main function magic_for() magicalize for loops. Magicalize means to change the behavior of for() to store values outputted via target functions.

magic_for()

for (i in 1:3) {
  squared <- i ^ 2
  put(squared)
}
#> The loop is magicalized with put().
#> squared: 1
#> squared: 4
#> squared: 9

The default target function is put(). It displays input values, for example:

x <- 1
put(x)
#> x: 1

You can take out stored values using magic_result_**() when for loops have finished.

magic_result_as_vector()
#> [1] 1 4 9

3.2 magic_for()

magic_for() has several options.

Specify the first argument func, you can change target functions.

magic_for(cat)

for (i in 1:3) {
  squared <- i ^ 2
  cat(squared, " ")
}
#> The loop is magicalized with cat().
#> 1  4  9

If progress = TRUE, show progress bar.

magic_for(progress = TRUE)

for (i in 1:3) {
  squared <- i ^ 2
  put(squared)
}
#> |=================================================================| 100%

If you set test a number, the iteration is limited to that number of times.

magic_for(test = 2)

for (i in 1:100) {
  squared <- i ^ 2
  put(squared)
}
#> The loop is magicalized with put().
#> squared: 1
#> squared: 4

If silent = TRUE, target function will be not executed but only the values will be stored.

If temp = TRUE, the effect of magicalization will be lost after once execution of for loop.

magic_for(temp = TRUE)
is_magicalized()
#> [1] TRUE

for (i in 1:3) {
  squared <- i ^ 2
  put(squared)
}
#> The loop is temporary magicalized with put().
#> squared: 1
#> squared: 4
#> squared: 9

is_magicalized()
#> [1] FALSE

3.3 magic_free()

You can use magic_free() to cancel magicalization.

magic_for()
is_magicalized()
#> [1] TRUE

magic_free()
is_magicalized()
#> [1] FALSE

The function also clear the stored values.

magic_for(silent = TRUE)

for (i in 1:3) {
  squared <- i ^ 2
  put(squared)
}

magic_result_as_vector()
#> [1] 1 4 9

magic_free()
magic_result_as_vector()
#> NULL

3.4 magic_result_**()

You can use magic_result_**() to obtain results from magicalized for loops.

magic_for(silent = TRUE)

for (i in 1:3) {
  squared <- i ^ 2
  put(squared)
}

magic_result() returns results as a list.

magic_result()
#> $squared
#> $squared[[1]]
#> [1] 1
#> 
#> $squared[[2]]
#> [1] 4
#> 
#> $squared[[3]]
#> [1] 9

magic_result_as_vector() returns results as a vector.

magic_result_as_vector()
#> [1] 1 4 9

magic_result_as_dataframe() returns results as a data.frame.

magic_result_as_dataframe()
#>   i squared
#> 1 1       1
#> 2 2       4
#> 3 3       9

3.5 put()

put() displays input values with high flexibility.

x <- 2
y <- 3
put(x)
#> x: 2
put(x, y)
#> x: 2, y: 3
put(x, x ^ 2, x ^ 3)
#> x: 2, x^2: 4, x^3: 8
put(x, squared = x ^ 2, cubed = x ^ 3)
#> x: 2, squared: 4, cubed: 8

It is very useful for magicfor.

magic_for()

for (i in 1:3) {
  put(x = i, squared = i ^ 2, cubed = i ^ 3)
}
#> The loop is magicalized with put().
#> x: 1, squared: 1, cubed: 1
#> x: 2, squared: 4, cubed: 8
#> x: 3, squared: 9, cubed: 27

magic_result_as_dataframe(F)
#>   x squared cubed
#> 1 1       1     1
#> 2 2       4     8
#> 3 3       9    27

4. Miscellaneous

Whenever you put just variables in magicalized for loops, their values will be stored regardless of target functions.

magic_for()

for (i in 1:3) {
  squared <- i ^ 2
  squared
}
#> The loop is magicalized with put().

magic_result_as_vector()
#> [1] 1 4 9

When you write trarget functions inside of if statements without else, NA will be inserted to represent missing.

magic_for()

for (i in 1:3) {
  squared <- i ^ 2
  if(i == 3) put(squared)
}
#> The loop is magicalized with put().
#> squared: 9

magic_result_as_vector()
#> [1] NA NA  9

Target functions work only top level lines or inside of if statements in magicalized for loops. For example, it does not work inside nested for loops.

magic_for()

for (i in 1:2) {
  for (j in 1:2) {
    put(i, j, i * j)
  }
}
#> The loop is magicalized with put().
#> i: 1, j: 1, i*j: 1
#> i: 1, j: 2, i*j: 2
#> i: 2, j: 1, i*j: 2
#> i: 2, j: 2, i*j: 4

magic_result_as_vector()
#> list()

5. Bug Reports

  • https://github.com/hoxo-m/magicfor/issues

2016/08/18

githubinstall 0.1.0: New Feature for A Helpful Way to Install R Packages Hosted on GitHub

We have updated our githubinstall package. It is now on CRAN.

Basics

Using the package, you can install R packages hosted on GitHub without usernames.

library(githubinstall)
githubinstall("AnomalyDetection")
# It is same as devtools::install_github("twitter/AnomalyDetection")

We introduced the package in the previous entry.

You can install or update the package as follows.

install.packages("githubinstall")

New Feature

We have added a new feature to the new version of the package.

Now, you can install packages with specifying Git references (branch, tag, commit and pull request).

Developers are divided in policy to manage R packages on GitHub. If a package is going to be developed in "develop" branch, you may want to install the package from the branch.

gh_install_packages() has ref argument to specify Git references. For instance, you can install awaptools from the "develop" branch as follows:

githubinstall("awaptools", ref = "develop")

You may sometimes encounter failing to install packages because its repository HEAD is broken. In such case, you can specify a tag or commit to ref. In almost cases, tags are added on an unbroken commit. For instance, you can install densratio from the “v0.0.3” tag as follows:

githubinstall("densratio", ref = "v0.0.3")

Even if you cannot find such tags, you can install packages from any commit that is not broken. For instance, you can install densratio from the “e8233e6” commit as follows:

githubinstall("densratio", ref = "e8233e6")

Finally, you may find a patch for fixing bugs as a pull request. In such case, you can specify pull requests to ref using github_pull(). For instance, you can install dplyr from the pull request #2058 as follows:

githubinstall("dplyr", ref = github_pull("2058"))

Bug Fixes

We have fixed some bugs reported on Issues. It has detailed on NEWS.

If you find some bugs or need new features, we would appreciate reporting it.

2016/06/15

githubinstall - New R Package for Easy to Install R Packages on GitHub

1. Overview

A growing number of R packages are created by various people in the world. A part of the cause of it is the devtools package that makes it easy to develop R packages [1]. The devtools package not only facilitates the process to develop R packages but also provides an another way to distribute R packages.

When developers publish R packages, the CRAN [2] is commonly used. You can install the packages that are available on CRAN using install.package(). For example, you can install dplyr package as follows:

install.packages("dplyr")

The devtools package provides install_github() that enables installing packages from GitHub.

library(devtools)
install_github("hadley/dplyr")

Therefore, developers can distribute R packages that is developing on GitHub. Moreover, there are some developers that they have no intention to submit to CRAN. For instance, Twitter, Inc. provides AnomalyDetection package on GitHub but it will not be available on CRAN [3]. You can install such packages easily using devtools.

library(devtools)
install_github("twitter/AnomalyDetection")

There is a difference between install.packages() and install_github() in the required argument. install.packages() takes package names, while install_github() needs repository names. It means that when you want to install a package on GitHub you must remember its repository name correctly.

The trouble is that the usernames of GitHub are often hard to remember. Developers consider the package names so that users can understand the functionalities intuitively. However, they often decide username incautiously. For instance, ggfortify is a great package on GitHub, but who created it? What is the username? The answer is sinhrks [4]. It seems to be difficult to remember it.

The githubinstall package provides a way to install packages on GitHub by only the package names just like install.packages().

library(githubinstall)
githubinstall("AnomalyDetection")
Suggetion:
 - twitter/AnomalyDetection
Do you install the package? 

1: Yes (Install)
2: No (Cancel)

githubinstall() suggests the GitHub repository from package names, and asks whether you want to execute the installation.

Furthermore, you may succeed in installing packages from a faint memory because our package automatically correct its spelling by fuzzy string search.

githubinstall("AnomaryDetection")
githubinstall("AnomalyDetect")
githubinstall("anomaly-detection")

2. Installation

You can install the githubinstall package from CRAN.

install.packages("githubinstall")

The source code for githubinstall package is available on GitHub at

3. Details

The githubinstall package provides several useful functions.

  • githubinstall() or gh_install_packages()
  • gh_suggest()
  • gh_suggest_username()
  • gh_list_packages()
  • gh_search_packages()
  • gh_show_source()
  • gh_update_package_list()

The functions have common prefix gh. githubinstall() is an alias of gh_install_packages().

To use these functions, first, you should load the package as follows.

library(githubinstall)

3.1. Install Packages from GitHub

githubinstall() enables to install packages on GitHub by only package names.

githubinstall("AnomalyDetection")
Suggestion:
 - twitter/AnomalyDetection
Do you install the package? 

1: Yes (Install)
2: No (Cancel)

Selection: 

The function suggests GitHub repositories. If you type ‘1’ and ‘enter’, then installation of the package will begin. The suggestion is made of looking for the list of R packages on GitHub. The list is provided by Gepuro Task Views.

If multiple candidates are found, you can select one of them.

githubinstall("cats")
Select one repository or, hit 0 to cancel. 

1: amurali2/cats      cats
2: danielwilhelm/cats No description or website provided.
3: hilaryparker/cats  An R package for cat-related functions #rcatladies
4: lolibear/cats      No description or website provided.
5: rafalszota/cats    No description or website provided.
6: tahir275/cats      ff

Selection: 

githubinstall() is an alias of gh_install_packages().

gh_install_packages("AnomalyDetection")

3.2. Suggest Repositories

githubinstall() prompts you to install the suggested packages. But you may just want to know what will be suggestions.

gh_suggest() returns the suggested repository names as a vector.

gh_suggest("AnomalyDetection")
## [1] "twitter/AnomalyDetection"
gh_suggest("cats")
## [1] "amurali2/cats"       "danielwilhelm/cats"  "davidluizrusso/cats"
## [4] "hilaryparker/cats"   "lolibear/cats"       "rafalszota/cats"    
## [7] "tahir275/cats"

In addition, gh_suggest_username() is useful if you want to know usernames from a faint memory.

gh_suggest_username("hadly")
## [1] "hadley"
gh_suggest_username("yuhui")
## [1] "yihui"

3.3. List the Packages

gh_list_packages() returns the list of R package repositories on GitHub as data.frame.

For example, if you want to get the repositories that have been created by hadley, run the following.

hadleyverse <- gh_list_packages(username = "hadley")
head(hadleyverse)
##   username package_name                                              title
## 1   hadley   assertthat                     User friendly assertions for R
## 2   hadley    babynames An R package contain all baby names data from the 
## 3   hadley    bigrquery          An interface to Google's bigquery from R.
## 4   hadley     bookdown                                              Watch
## 5   hadley   clusterfly An R package for visualising high-dimensional clus
## 6   hadley      decumar                           An alternative to sweave

By using the result, you can install all packages created by hadley.

repos <- with(hadleyverse, paste(username, package_name, sep="/"))
githubinstall(repos) # I have not tried it

3.4. Search Packages by a Keyword

gh_search_packages() returns the list of R package repositories on GitHub that the titles contains a given keyword.

For example, if you want to search packages that are relevant to lasso, run the following.

gh_search_packages("lasso")
##           username     package_name                                  title
## 1  ChingChuan-Chen             milr  multiple-instance logistic regressi..
## 2       YaohuiZeng         biglasso  Big Lasso: Extending Lasso Model Fi..
## 3      huayingfang          CCLasso  CCLasso: Correlation Inference for ..
## 4         mlampros FeatureSelection  Feature Selection in R using glmnet..
## 5             pnnl        glmnetLRC  Lasso and Elastic-Net Logistic Regr..
## 6       statsmaths         genlasso  Path algorithm for generalized lass..
## 7       vincent-dk         logitsgl  Fit Logistic Regression with Multi-..
## 8       vincent-dk             lsgl  Linear Multiple Output Using Sparse..
## 9       vincent-dk             msgl  High Dimensional Multiclass Classif..
## 10      vstanislas             GGEE  R Package for the Group Lasso Gene-..
## 11          zdk123       BatchStARS  R package for Stability Approach to..
## 12          zdk123           pulsar  R package for Stability Approach to..

3.5. Show the Source Code of Functions on GitHub

gh_show_source() looks for the source code of a given function on GitHub, and tries to open the place on Web browser.

gh_show_source("mutate", "dplyr")

If you have loaded the package that the function belongs to, you can input the function directly.

library(dplyr)
gh_show_source(mutate)

This function may do not work well with Safari.

3.6. Update the List of R Packages

The githubinstall package uses Gepuro Task Views for getting the list of R packages on GitHub. Gepuro Task Views is crawling the GitHub and updates information every day. The package downloads the list of R packages from Gepuro Task Views each time it was loaded. Thus, you can always use the newest list of packages on a new R session.

However, you may use an R session for a long time. In such case, gh_update_package_list() is useful.

gh_update_package_list() updates the downloaded list of the R packages explicitly.

gh_update_package_list()

2016/04/01

densratio: New R Package for Density Ratio Estimation

1. Overview

Density ratio estimation is described as follows: for given two data samples $x$ and $y$ from unknown distributions $p(x)$ and $q(y)$ respectively, estimate $$ w(x) = \frac{p(x)}{q(x)} $$ where $x$ and $y$ are $d$-dimensional real numbers.

The estimated density ratio function $w(x)$ can be used in many applications such as the inlier-based outlier detection [1] and covariate shift adaptation [2]. Other useful applications about density ratio estimation were summarized by Sugiyama et al. (2012) [3].

The package densratio provides a function densratio() that returns a result has the function to estimate density ratio compute_density_ratio().

For example,

set.seed(3)
x <- rnorm(200, mean = 1, sd = 1/8)
y <- rnorm(200, mean = 1, sd = 1/2)

library(densratio)
result <- densratio(x, y)
result
## 
## Call:
## densratio(x = x, y = y, method = "uLSIF")
## 
## Kernel Information:
##   Kernel type:  Gaussian RBF 
##   Number of kernels:  100 
##   Bandwidth(sigma):  0.1 
##   Centers:  num [1:100, 1] 1.007 0.752 0.917 0.824 0.7 ...
## 
## Kernel Weights(alpha):
##   num [1:100] 0.4044 0.0479 0.1736 0.125 0.0597 ...
## 
## The Function to Estimate Density Ratio:
##   compute_density_ratio()

In this case, the true density ratio $w(x)$ is known, so we can compare $w(x)$ with the estimated density ratio $\hat{w}(x)$.

true_density_ratio <- function(x) dnorm(x, 1, 1/8) / dnorm(x, 1, 1/2)
estimated_density_ratio <- result$compute_density_ratio

plot(true_density_ratio, xlim=c(-1, 3), lwd=2, col="red", xlab = "x", ylab = "Density Ratio")
plot(estimated_density_ratio, xlim=c(-1, 3), lwd=2, col="green", add=TRUE)
legend("topright", legend=c(expression(w(x)), expression(hat(w)(x))), col=2:3, lty=1, lwd=2, pch=NA)

2. How to Install

You can install the densratio package from CRAN.

install.packages("densratio")

You can also install the package from GitHub.

install.packages("devtools") # if you have not installed "devtools" package
devtools::install_github("hoxo-m/densratio")

The source code for densratio package is available on GitHub at

3. Details

3.1. Basics

The package provides densratio() that the result has the function to estimate density ratio.

For data samples x and y,

library(densratio)

x <- rnorm(200, mean = 1, sd = 1/8)
y <- rnorm(200, mean = 1, sd = 1/2)

result <- densratio(x, y)

In this case, result$compute_density_ratio() can compute estimated density ratio.

w_hat <- result$compute_density_ratio(y)
plot(y, w_hat)

3.2. Methods

densratio() has method parameter that you can pass "uLSIF" or "KLIEP".

  • uLSIF (unconstrained Least-Squares Importance Fitting) is the default method. This algorithm estimates density ratio by minimizing the squared loss. You can find more information in Hido et al. (2011) [1].

  • KLIEP (Kullback-Leibler Importance Estimation Procedure) is the anothor method. This algorithm estimates density ratio by minimizing Kullback-Leibler divergence. You can find more information in Sugiyama et al. (2007) [2].

The both methods assume that the denity ratio is represented by linear model: $$ w(x) = \alpha_1 K(x, c_1) + \alpha_2 K(x, c_2) + ... + \alpha_b K(x, c_b) $$ where $$ K(x, c) = \exp\left(\frac{-\|x - c\|^2}{2 \sigma ^ 2}\right) $$ is the Gaussian RBF.

densratio() performs the two main jobs:

  • First, deciding kernel parameter $\sigma$ by cross validation,
  • Second, optimizing kernel weights $\alpha$.

As the result, you can obtain compute_density_ratio().

3.3. Result and Paremeter Settings

densratio() outputs the result like as follows:

## 
## Call:
## densratio(x = x, y = y, method = "uLSIF")
## 
## Kernel Information:
##   Kernel type:  Gaussian RBF 
##   Number of kernels:  100 
##   Bandwidth(sigma):  0.1 
##   Centers:  num [1:100, 1] 1.007 0.752 0.917 0.824 0.7 ...
## 
## Kernel Weights(alpha):
##   num [1:100] 0.4044 0.0479 0.1736 0.125 0.0597 ...
## 
## Regularization Parameter(lambda):  
## 
## The Function to Estimate Density Ratio:
##   compute_density_ratio()
  • Kernel type is fixed by Gaussian RBF.
  • The number of kernels is the number of kernels in the linear model. You can change by setting kernel_num parameter. In default, kernel_num = 100.
  • Bandwidth(sigma) is the Gaussian kernel bandwidth. In default, sigma = "auto", the algorithms automatically select the optimal value by cross validation. If you set sigma a number, that will be used. If you set a numeric vector, the algorithms select the optimal value in them by cross validation.
  • Centers are centers of Gaussian kernels in the linear model. These are selected at random from the data sample x underlying a numerator distribution p_nu(x). You can find the whole values in result$kernel_info$centers.
  • Kernel weights are alpha parameters in the linear model. It is optimaized by the algorithms. You can find the whole values in result$alpha.
  • The funtion to estimate density ratio is named compute_density_ratio().

4. Multi Dimensional Data Samples

In the above, the input data samples x and y were one dimensional. densratio() allows to input multidimensional data samples as matrix.

For example,

library(densratio)
library(mvtnorm)

set.seed(71)
x <- rmvnorm(300, mean = c(1, 1), sigma = diag(1/8, 2))
y <- rmvnorm(300, mean = c(1, 1), sigma = diag(1/2, 2))

result <- densratio(x, y)
result
## 
## Call:
## densratio(x = x, y = y, method = "uLSIF")
## 
## Kernel Information:
##   Kernel type:  Gaussian RBF 
##   Number of kernels:  100 
##   Bandwidth(sigma):  0.316 
##   Centers:  num [1:100, 1:2] 1.178 0.863 1.453 0.961 0.831 ...
## 
## Kernel Weights(alpha):
##   num [1:100] 0.145 0.128 0.138 0.187 0.303 ...
## 
## Regularization Parameter(lambda):  0.1 
## 
## The Function to Estimate Density Ratio:
##   compute_density_ratio()

Also in this case, we can compare the true density ratio with the estimated density ratio.

true_density_ratio <- function(x) {
  dmvnorm(x, mean = c(1, 1), sigma = diag(1/8, 2)) /
    dmvnorm(x, mean = c(1, 1), sigma = diag(1/2, 2))
}
estimated_density_ratio <- result$compute_density_ratio

N <- 20
range <- seq(0, 2, length.out = N)
input <- expand.grid(range, range)
z_true <- matrix(true_density_ratio(input), nrow = N)
z_hat <- matrix(estimated_density_ratio(input), nrow = N)

par(mfrow = c(1, 2))
contour(range, range, z_true, main = "True Density Ratio")
contour(range, range, z_hat, main = "Estimated Density Ratio")

The dimensions of x and y must be same.

5. References

[1] Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., & Kanamori, T. Statistical outlier detection using direct density ratio estimation. Knowledge and Information Systems 2011.

[2] Sugiyama, M., Nakajima, S., Kashima, H., von Bünau, P. & Kawanabe, M. Direct importance estimation with model selection and its application to covariate shift adaptation. NIPS 2007.

[3] Sugiyama, M., Suzuki, T. & Kanamori, T. Density Ratio Estimation in Machine Learning. Cambridge University Press 2012.

2015/08/04

New package "dplyrr" - Utilities for comfortable use of dplyr with databases

1. Overview

dplyr package is the most powerful package for data handling in R, and it has also the ability of working with databases(See Vignette).
But the functionalities of dealing with databases in dplyr is begin developped even now.
Now, I'm trying to make dplyr with databases more comfortable by some functions.
For that purpose, I've created dplyrr package.
dplyrr has below functions:
  • load_tbls() : Easy to load table objects for all tables in a database.
  • cut() in mutate() : Easy to create a case statement by using the grammar like the base::cut().
  • count_if() and n_if() in summarise() : Shortcut to count rows that a condition is satisfied.
  • filter() : Improved filter() for tbl_sql which adds parentheses appropriately.
  • moving_mean() in mutate() : Compute moving average for PostgreSQL.
  • moving_max() in mutate() : Compute moving max for PostgreSQL.
  • moving_min() in mutate() : Compute moving min for PostgreSQL.
  • moving_sum() in mutate() : Compute moving sum for PostgreSQL.
  • first_value() in mutate() : Compute first value for PostgreSQL.

2. How to install

The source code for dplyrr package is available on GitHub at
You can install the pakage from there.
install.packages("devtools") # if you have not installed "devtools" package
devtools::install_github("hoxo-m/dplyrr")

3. Common functions for all databases

For illustration, we use a database file: "my_db.sqlite3".
If you want to trace the codes below, you should create the databese file at first.
library(dplyrr)
library(nycflights13)

db <- src_sqlite("my_db.sqlite3", create = TRUE)
copy_nycflights13(db)

3-1. load_tbls()

Usually, when we use a database with dplyr, we first create database object, and we can see the tables in the databese by show().
library(dplyrr)
# Create database object
db <- src_sqlite("my_db.sqlite3")
show(db)
## src:  sqlite 3.8.6 [my_db.sqlite3]
## tbls: airlines, airports, flights, planes, sqlite_stat1, weather
Next, we create table objects for pulling data in some tables in the database.
airlines_tbl <- tbl(db, "airlines")
airports_tbl <- tbl(db, "airports")
flights_tbl <- tbl(db, "flights")
planes_tbl <- tbl(db, "planes")
weather_tbl <- tbl(db, "weather")
Typing this code is really a bore!
If you want to create table objects for all tables in the database, you can use load_tbls().
load_tbls(db)
## Loading: airlines_tbl
## Loading: airports_tbl
## Loading: flights_tbl
## Loading: planes_tbl
## Loading: sqlite_stat1_tbl
## Loading: weather_tbl
Check the created table objects.
ls(pattern = "_tbl$")
## [1] "airlines_tbl"     "airports_tbl"     "flights_tbl"     
## [4] "planes_tbl"       "sqlite_stat1_tbl" "weather_tbl"
glimpse(airlines_tbl)
## Observations: 16
## Variables:
## $ carrier (chr) "9E", "AA", "AS", "B6", "DL", "EV", "F9", "FL", "HA", ...
## $ name    (chr) "Endeavor Air Inc.", "American Airlines Inc.", "Alaska...

3-2. cut() in mutate()

If you want to write case statement with like base::cut(), you can use cut() function in mutate().
For example, there is air_time column in the database.
db <- src_sqlite("my_db.sqlite3")
flights_tbl <- tbl(db, "flights")
q <- flights_tbl %>% select(air_time)
air_time <- q %>% collect
head(air_time, 3)
## Source: local data frame [3 x 1]
## 
##   air_time
## 1      227
## 2      227
## 3      160
If you want to group the air_time by break points c(0, 80, 120, 190, 900), you think you must write the next code.
q <- flights_tbl %>% 
  select(air_time) %>%
  mutate(air_time_cut = if(air_time > 0 && air_time <= 80) "(0,80]"
         else if(air_time > 80 && air_time <= 120) "(80,120]"
         else if(air_time > 120 && air_time <= 190) "(120,190]"
         else if(air_time > 190 && air_time <= 900) "(190,900]")
air_time_with_cut <- q %>% collect
head(air_time_with_cut, 3)
## Source: local data frame [3 x 2]
## 
##   air_time air_time_cut
## 1      227    (190,900]
## 2      227    (190,900]
## 3      160    (120,190]
When the break points increase, you are going to be tired to write more lines.
By using cut() function in mutate(), it can become easy.
q <- flights_tbl %>% 
  select(air_time) %>%
  mutate(air_time_cut = cut(air_time, breaks=c(0, 80, 120, 190, 900)))
air_time_with_cut <- q %>% collect
head(air_time_with_cut, 3)
## Source: local data frame [3 x 2]
## 
##   air_time air_time_cut
## 1      227    (190,900]
## 2      227    (190,900]
## 3      160    (120,190]
The cut() in mutate() has more arguments such as labels coming from base::cut().
  • cut(variable, breaks, labels, include.lowest, right, dig.lab)

For integer break points, specially you can indicate labels="-".
q <- flights_tbl %>% 
  select(air_time) %>%
  mutate(air_time_cut = cut(air_time, breaks=c(0, 80, 120, 190, 900), labels="-"))
air_time_with_cut <- q %>% collect
head(air_time_with_cut, 3)
## Source: local data frame [3 x 2]
## 
##   air_time air_time_cut
## 1      227      191-900
## 2      227      191-900
## 3      160      121-190

3-3. count_if() and n_if() in summarise()

When we want to count rows that condition is satisfied, we might write like this.
q <- flights_tbl %>% 
  select(air_time) %>%
  summarise(odd_airtime_rows = sum(if(air_time %% 2 == 1) 1L else 0L), 
            even_airtime_rows = sum(if(air_time %% 2 == 0) 1L else 0L), 
            total_rows=n())
q %>% collect
## Source: local data frame [1 x 3]
## 
##   odd_airtime_rows even_airtime_rows total_rows
## 1           164150            163196     336776
The count_if() and n_if() functions are a shortcut for it merely.
  • count_if(condition)
  • n_if(condition)

q <- flights_tbl %>% 
  select(air_time) %>%
  summarise(odd_airtime_rows = count_if(air_time %% 2 == 1), 
            even_airtime_rows = n_if(air_time %% 2 == 0), 
            total_rows=n())
q %>% collect
## Source: local data frame [1 x 3]
## 
##   odd_airtime_rows even_airtime_rows total_rows
## 1           164150            163196     336776
Both functions do exactly the same thing.

3-4. Improved filter()

If you use dplyr with databases in pure mind, you can encounter the unintended action like below.
library(dplyr)

db <- src_sqlite("my_db.sqlite3")
flights_tbl <- tbl(db, "flights")
q <- flights_tbl %>%
  select(month, air_time) %>%
  filter(month == 1) %>%
  filter(air_time > 200 || air_time < 100)
q$query
## <Query> SELECT "month" AS "month", "air_time" AS "air_time"
## FROM "flights"
## WHERE "month" = 1.0 AND "air_time" > 200.0 OR "air_time" < 100.0
## <SQLiteConnection>
Did you expect the WHERE clause to be that?
If you use dplyrr, it becomes natural by adding parentheses.
library(dplyrr)

db <- src_sqlite("my_db.sqlite3")
flights_tbl <- tbl(db, "flights")
q <- flights_tbl %>%
  select(month, air_time) %>%
  filter(month == 1) %>%
  filter(air_time > 200 || air_time < 100)
q$query
## <Query> SELECT "month" AS "month", "air_time" AS "air_time"
## FROM "flights"
## WHERE ("month" = 1.0) AND ("air_time" > 200.0 OR "air_time" < 100.0)
## <SQLiteConnection>

4. Functions for PostgreSQL

4-1. moving_**() in mutate()

dplyrr has four moving_**() functions that you can use in mutate().
  • moving_mean(variable, preceding, following)
  • moving_max(variable, preceding, following)
  • moving_min(variable, preceding, following)
  • moving_sum(variable, preceding, following)
When you want to set the same preceding and following, you can omit following.
For illustration, we use the test database that is PostgreSQL.
srcs <- temp_srcs("postgres")
df <- data.frame(x = 1:5)
tbls <- dplyr:::temp_load(srcs, list(df = df))
temp_tbl <- tbls$postgres$df
head(temp_tbl)
##   x
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
Compute moving average with 1 preceding and 1 following.
q <- temp_tbl %>%
  mutate(y = moving_mean(x, 1))
q %>% collect
## Source: local data frame [5 x 2]
## 
##   x   y
## 1 1 1.5
## 2 2 2.0
## 3 3 3.0
## 4 4 4.0
## 5 5 4.5
Comfirm query.
q$query
## <Query> SELECT "x", "y"
## FROM (SELECT "x", avg("x") OVER (ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS "y"
## FROM "tlsqbjsuou") AS "_W1"
## <PostgreSQLConnection:(10316,0)>
Compute moving mean with 1 preceding and 2 following.
q <- temp_tbl %>%
  mutate(y = moving_mean(x, 1, 2))
q %>% collect
## Source: local data frame [5 x 2]
## 
##   x   y
## 1 1 2.0
## 2 2 2.5
## 3 3 3.5
## 4 4 4.0
## 5 5 4.5
Comfirm query.
q$query
## <Query> SELECT "x", "y"
## FROM (SELECT "x", avg("x") OVER (ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) AS "y"
## FROM "tlsqbjsuou") AS "_W2"
## PostgreSQLConnection:(10316,0)>
Similary, you can use the other moving_**() functions.

4-2. first_value() in mutate()

dplyrr has first_value() function that you can use in mutate().
  • first_value(value, order_by)
When you want to set the same value and order_by, you can omit order_by.
For illustration, we use the test database that is PostgreSQL.
srcs <- temp_srcs("postgres")
df <- data.frame(class = c("A", "A", "B", "B", "C", "C"), x = 1:6, y = 6:1)
tbls <- dplyr:::temp_load(srcs, list(df=df))
temp_tbl <- tbls$postgres$df
head(temp_tbl)
##   class x y
## 1     A 1 6
## 2     A 2 5
## 3     B 3 4
## 4     B 4 3
## 5     C 5 2
## 6     C 6 1
Get the first values of x partitioned by class and ordered by x.
q <- temp_tbl %>%
  group_by(class) %>%
  mutate(z = first_value(x))
q %>% collect
## Source: local data frame [6 x 4]
## Groups: class
## 
##   class x y z
## 1     A 1 6 1
## 2     A 2 5 1
## 3     B 3 4 3
## 4     B 4 3 3
## 5     C 5 2 5
## 6     C 6 1 5
See query.
q$query
## <Query> SELECT "class", "x", "y", "z"
## FROM (SELECT "class", "x", "y", first_value("x") OVER (PARTITION BY "class" ORDER BY "x") AS "z"
## FROM "slrhxfdvrt") AS "_W3"
## <PostgreSQLConnection:(10316,0)>
Get the first values of x partitioned by class and ordered by y.
q <- temp_tbl %>%
  group_by(class) %>%
  mutate(z = first_value(x, y))
q %>% collect
## Source: local data frame [6 x 4]
## Groups: class
## 
##   class x y z
## 1     A 2 5 2
## 2     A 1 6 2
## 3     B 4 3 4
## 4     B 3 4 4
## 5     C 6 1 6
## 6     C 5 2 6
See query.
q$query
## <Query> SELECT "class", "x", "y", "z"
## FROM (SELECT "class", "x", "y", first_value("x") OVER (PARTITION BY "class" ORDER BY "y") AS "z"
## FROM "slrhxfdvrt") AS "_W4"
## <PostgreSQLConnection:(10316,0)>
Get the first values of x partitioned by class and ordered by descent of y.
q <- temp_tbl %>%
  group_by(class) %>%
  mutate(z = first_value(x, desc(y)))
q %>% collect
## Source: local data frame [6 x 4]
## Groups: class
## 
##   class x y z
## 1     A 1 6 1
## 2     A 2 5 1
## 3     B 3 4 3
## 4     B 4 3 3
## 5     C 5 2 5
## 6     C 6 1 5
See query.
q$query
## <Query> SELECT "class", "x", "y", "z"
## FROM (SELECT "class", "x", "y", first_value("x") OVER (PARTITION BY "class" ORDER BY "y" DESC) AS "z"
## FROM "slrhxfdvrt") AS "_W5"
## <PostgreSQLConnection:(10316,0)>

5. Miscellaneous

update_dplyrr()

update_dplyrr() is a shortcut/syntax sugar of
devtools::install_github("hoxo-m/dplyrr")

unload_dplyrr()

unload_dplyrr() is a shortcut/syntax sugar of
detach("package:dplyrr", unload = TRUE)
detach("package:dplyr", unload = TRUE)

6. Bug reports


2015/07/28

The complete catalog of argument variations of select() in dplyr

When I read the dplyr vignette, I found a convenient way to select sequential columns such as select(data, year:day).
Because I had inputted only column names to select() function, I was deeply affected by the convenient way.

On closer inspection, I found that the select() function accepts many types of input.
Here, I will enumerate the variety of acceptable inputs for select() function.

By the way, these column selection methods also can use in the summarise_each(), mutate_each() and some functions in tidyr package(e.g. gather()).

1. Whole codes.

At first, whole codes were shown for perspicuity.
In the sections below, the details of each command were shown.

# Data preparation ------------------------------------------------------------------
library(dplyr)
library(nycflights13)
set.seed(123)
data <- sample_n(flights, 3)

glimpse(data)

# Basic method of use select() ----------------------------------------------------------------------
select(data, year)
select(data, year, month, day)
select(data, year:day)

select(data, -year, -month, -day)
select(data, -(year:day))

select(data, 1, 2, 3)
select(data, 1:3)

select(data, -1, -2, -3)
select(data, -(1:3))

select(data, year:day, -month)
select(data, -(year:day), month)
select(data, 1:3, -2)
select(data, -(1:3), 2)

# Utility functions of select() --------------------------------------------------------------
select(data, starts_with("arr"))
select(data, ends_with("time"))
select(data, contains("_"))
select(data, matches("^(dep|arr)_"))

data2 <- data
colnames(data2) <- sprintf("x%d", 1:16)
select(data2, num_range("x", 8:11))
select(data2, num_range("x", c(9, 11)))

data3 <- data
colnames(data3) <- sprintf("x%02d", 1:16)
select(data3, num_range("x", 8:11, width=2))

col_vector <- c("year", "month", "day")
select(data, col_vector)
select(data, one_of(col_vector))

select(data, everything())

select(data, -starts_with("arr"))

# Standard evaluation --------------------------------------------------------------------
select_(data, "year", "month", "day")

col_vector <- c("year", "month", "day")
select_(data, .dots = col_vector)

select_(data, 'year:day')
select_(data, 'year:day', '-month')
select_(data, '-(year:day)')
select_(data, 'starts_with("arr")')
select_(data, '-ends_with("time")')

select_(data, .dots = c('starts_with("arr")', '-ends_with("time")'))

2. Data preparation.

To follow the dplyr vignette, flights data set in nycflights13 package were used as an example.

library(dplyr)
library(nycflights13)

set.seed(123)
data <- sample_n(flights, 3)

glimpse(data)
Variables:
$ year      (int) 2013, 2013, 2013
$ month     (int) 12, 7, 3
$ day       (int) 15, 17, 2
$ dep_time  (int) 2124, 651, 1636
$ dep_delay (dbl) -4, -9, 1
$ arr_time  (int) 2322, 936, 1800
$ arr_delay (dbl) 1, -28, 0
$ carrier   (chr) "UA", "DL", "WN"
$ tailnum   (chr) "N801UA", "N194DN", "N475WN"
$ flight    (int) 289, 763, 1501
$ origin    (chr) "EWR", "JFK", "LGA"
$ dest      (chr) "DTW", "LAX", "MKE"
$ air_time  (dbl) 88, 306, 103
$ distance  (dbl) 488, 2475, 738
$ hour      (dbl) 21, 6, 16
$ minute    (dbl) 24, 51, 36

This data set includes the 16 columns shown above.

3. Basic method of use select().

At first, the ways of using select() were shown.

select(data, year)
  year
1 2013
2 2013
3 2013

This process shows the way to take the year column out of data. To pick multiple columns, you can write the following.

select(data, year, month, day)
  year month day
1 2013    12  15
2 2013     7  17
3 2013     3   2

If columns were sequential in the dataset, you could write the following to pick sequential columns.

select(data, year:day)
  year month day
1 2013    12  15
2 2013     7  17
3 2013     3   2

If you want to remove a specific column, add - in the head of the column name as follows.

select(data, -year, -month, -day)
  dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
1     2124        -4     2322         1      UA  N801UA    289    EWR  DTW       88      488   21     24
2      651        -9      936       -28      DL  N194DN    763    JFK  LAX      306     2475    6     51
3     1636         1     1800         0      WN  N475WN   1501    LGA  MKE      103      738   16     36

To remove sequential columns, put sequential columns in brackets () connected with a colon.

select(data, -(year:day))
  dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
1     2124        -4     2322         1      UA  N801UA    289    EWR  DTW       88      488   21     24
2      651        -9      936       -28      DL  N194DN    763    JFK  LAX      306     2475    6     51
3     1636         1     1800         0      WN  N475WN   1501    LGA  MKE      103      738   16     36

It is also possible to pick columns by choosing the column number.

select(data, 1, 2, 3)
select(data, 1:3)
  year month day
1 2013    12  15
2 2013     7  17
3 2013     3   2

Next topics are slightly advanced.

It is possible to pick sequential columns temporarily and remove some of these.

select(data, year:day, -month)
  year day
1 2013  15
2 2013  17
3 2013   2

It is also possible to remove sequential columns and keep a part of these.

select(data, -(year:day), month)
  dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute month
1     2124        -4     2322         1      UA  N801UA    289    EWR  DTW       88      488   21     24    12
2      651        -9      936       -28      DL  N194DN    763    JFK  LAX      306     2475    6     51     7
3     1636         1     1800         0      WN  N475WN   1501    LGA  MKE      103      738   16     36     3

Even using a column number can give the same result with the column name. (The results are omitted.)

select(data, 1:3, -2)
select(data, -(1:3), 2)

4. Utility functions of select().

Utility functions existing in select(), summarise_each() and mutate_each() in dplyr as well as some functions in the tidyr package.

Seven functions existed in the utility functions of select().

  • starts_with(match, ignore.case = TRUE)
  • ends_with(match, ignore.case = TRUE)
  • contains(match, ignore.case = TRUE)
  • matches(match, ignore.case = TRUE)
  • num_range(prefix, range, width = NULL)
  • one_of(...)
  • everything()

We now check the respective commands and how to use them.

First, starts_with() picks columns whose name starts with the specified string.

select(data, starts_with("arr"))
  arr_time arr_delay
1     2322         1
2      936       -28
3     1800         0

The argument ignore.case specifies whether the lowercase is classified as a capital letter(default is TRUE).

The ends_with() picks columns whose name ends with the specified string .

select(data, ends_with("time"))
  dep_time arr_time air_time
1     2124     2322       88
2      651      936      306
3     1636     1800      103

The contains() picks columns whose name contains the specified string.

select(data, contains("_"))
  dep_time dep_delay arr_time arr_delay air_time
1     2124        -4     2322         1       88
2      651        -9      936       -28      306
3     1636         1     1800         0      103

The matches() picks columns based on a regular expression matching string.

select(data, contains("_"))
  dep_time dep_delay arr_time arr_delay air_time
1     2124        -4     2322         1       88
2      651        -9      936       -28      306
3     1636         1     1800         0      103

When the numbers were included in column names, num_range() might be useful.
In this example, we change the column names to be x1-x16 and execute num_range() command for the data set.

data2 <- data
colnames(data2) <- sprintf("x%d", 1:16)
select(data2, num_range("x", 8:11))
  x8     x9  x10 x11
1 UA N801UA  289 EWR
2 DL N194DN  763 JFK
3 WN N475WN 1501 LGA

By specifying as num_range("x", 8:11), columns x8 to x11 can be identified.
Numbers in the column name are not necessarily sequential.

select(data2, num_range("x", c(9, 11)))
      x9 x11
1 N801UA EWR
2 N194DN JFK
3 N475WN LGA

When column names were padded, the column name was shown as x01.
Here, the argument width in num_range() might be useful.
We now try this process for a data that changes the column names as x01-x16.

data3 <- data
colnames(data3) <- sprintf("x%02d", 1:16)
select(data3, num_range("x", 8:11, width=2))
  x08    x09  x10 x11
1  UA N801UA  289 EWR
2  DL N194DN  763 JFK
3  WN N475WN 1501 LGA

By specifying as width=2, the zero filled columns can be picked out.

When a column is named as a vector or character string, one_of() might be useful.
An error occurs in the following case with select().

col_vector <- c("year", "month", "day")
select(data, col_vector)
Error: All select() inputs must resolve to integer column positions.
The following do not:
*  col_vector

However, an intended process occurs in the case of one_of().

select(data, one_of(col_vector))
  year month day
1 2013    12  15
2 2013     7  17
3 2013     3   2

The everything() selects all columns. (Result was omitted.)

select(data, everything())

If add - is in the head of a utility function name, we can pick out all except for the area specified in the utility function.

select(data, -starts_with("arr"))
  year month day dep_time dep_delay carrier tailnum flight origin dest air_time distance hour minute
1 2013    12  15     2124        -4      UA  N801UA    289    EWR  DTW       88      488   21     24
2 2013     7  17      651        -9      DL  N194DN    763    JFK  LAX      306     2475    6     51
3 2013     3   2     1636         1      WN  N475WN   1501    LGA  MKE      103      738   16     36

5. Standard evaluation.

Thus far, we explained the normal select() function; however, the normal select() function cannot handle character strings as arguments.
This might become a problem when column names are given as a string vector for example.
To solve this problem, the select_() function was equipped in dplyr. (Caution: An underscore was added in the function name.)
The use of the select_() function is the same as the select() except specifying columns by string; however, attention is needed when specifying a column name by a vector.

select_(data, "year", "month", "day")
  year month day
1 2013    12  15
2 2013     7  17
3 2013     3   2

When specifying column names by a vector, the vector should be given the .dot argument.

col_vector <- c("year", "month", "day")
select_(data, .dots = col_vector)
  year month day
1 2013    12  15
2 2013     7  17
3 2013     3   2

All arguments that can use the select() function are also possible candidates for the select_() function.

select_(data, 'year:day')
select_(data, 'year:day', '-month')
select_(data, '-(year:day)')
select_(data, 'starts_with("arr")')
select_(data, '-ends_with("time")')

Furthermore, also in this case, the argument vector should be given the .dot argument in the select_() function.

select_(data, .dots = c('starts_with("arr")', '-ends_with("time")'))

6. References.

Introduction to dplyr
https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Help page for select() function
> help("select", package = "dplyr")

Original entry in Japanese by hoxo_m
https://qiita.com/hoxo_m/items/f2f1793c6f086d381340

Translated by siero

2015/07/12

New package "SparkRext" - SparkR extension for closer to dplyr

Apache Spark is one of the hottest products in data science.
Spark 1.4.0 has formally adopted SparkR package which enables to handle Spark DataFrames on R.

SparkR is very useful and powerful.
One of the reasons is that SparkR DataFrames present an API similar to dplyr.
We launced our new package "SparkRext" to redefine the functions of SparkR to enable NSE(Non stanard Evaluation) inputs.
As a result, the functions will be able to be used in the same way as dplyr.

If you want to know about SparkRext package in detail, please check our blog post here.