fastplyr aims to provide a tidyverse frontend using a collapse backend. This means from a user’s point of view the functions behave like the tidyverse equivalents and thus require little to no changes to existing code to convert.
fastplyr is designed to handle operations that involve larger numbers of groups and generally larger data.
You can install the development version of fastplyr from GitHub with:
# install.packages("pak")
::pak("NicChr/fastplyr") pak
Load packages
library(tidyverse)
#> Warning: package 'dplyr' was built under R version 4.4.1
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.4 ✔ readr 2.1.5
#> ✔ forcats 1.0.0 ✔ stringr 1.5.1
#> ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
#> ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
#> ✔ purrr 1.0.2
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(fastplyr)
#>
#> Attaching package: 'fastplyr'
#>
#> The following object is masked from 'package:dplyr':
#>
#> desc
#>
#> The following objects are masked from 'package:tidyr':
#>
#> crossing, nesting
library(nycflights13)
library(bench)
While the syntax and user-interface of fastplyr aligns very closely with dplyr most of the time, there can be a few key differences.
dplyr |
fastplyr |
|
---|---|---|
.by
|
Groups are sorted by order of first appearance always when using
.by
|
Groups are always sorted by default, even when using .by .
One can use the other sorting through f_group_by(.order =
F)
|
Many groups | Generally slow for data with many groups. | Designed to be fast for data with many groups. |
Handling of dots (… )
|
dplyr almost always executes … expressions in a way that
latter expressions depend on previous ones
|
Some functions like f_summarise and f_expand
execute the expressions in … independently.
|
Duplicate rows |
No dedicated function for this, solution using group_by |>
filter(n() > 1) are generally slow for larger data.
|
Dedicated function f_duplicates can do this very fast and
with fine control.
|
Unique group IDs |
Achieved through mutate(cur_group_id())
|
Dedicated fast function add_group_id()
|
Row slicing |
slice() supports data-masked expressions supplied to
…
|
Data-masked expressions not supported in f_slice_
functions. Use f_filter() for this behaviour.
|
Memory usage | High memory usage | Lower usage compared to dplyr |
joins | Accepts different types of joins, e.g. rolling and equality joins. |
Accepts only equality joins of the form x == y
|
rowwise |
rowwise_df accepted and everything sub-setted implictly
using [[
|
rowwise_df not accepted, must use f_rowwise_df
which creates a grouped_df with a row ID col. Implicit
[[ subsetting does not occur.
|
All tidyverse alternative functions are prefixed with ‘f_’. For
example, dplyr::distinct
becomes
fastplyr::f_distinct
.
|>
flights f_distinct(origin, dest)
#> # A tibble: 224 × 2
#> origin dest
#> <chr> <chr>
#> 1 EWR IAH
#> 2 LGA IAH
#> 3 JFK MIA
#> 4 JFK BQN
#> 5 LGA ATL
#> # ℹ 219 more rows
f_distinct
has an additional sort
argument
which is much faster than sorting afterwards.
mark(
fastplyr_distinct_sort = flights |>
f_distinct(origin, dest, tailnum, .sort = TRUE),
dplyr_distinct_sort = flights |>
distinct(origin, dest, tailnum) |>
arrange_all()
)#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_distinct_sort 10.5ms 12.6ms 75.0 2.96MB 2.08
#> 2 dplyr_distinct_sort 22.2ms 23.5ms 38.6 11.38MB 7.71
f_group_by
operates very similarly with an additional
feature that allows you to specify whether group data should be ordered
or not. This ultimately controls if the groups end up sorted in
expressions like count
and summarise
, but also
in this case f_count
and f_summarise
.
# Like dplyr
|>
flights f_group_by(month) |>
f_count()
#> # A tibble: 12 × 2
#> # Groups: month [12]
#> month n
#> <int> <int>
#> 1 1 27004
#> 2 2 24951
#> 3 3 28834
#> 4 4 28330
#> 5 5 28796
#> # ℹ 7 more rows
# Group data is sorted by order-of-first appearance
|>
flights f_group_by(month, .order = FALSE) |>
f_count()
#> # A tibble: 12 × 2
#> # Groups: month [12]
#> month n
#> <int> <int>
#> 1 1 27004
#> 2 10 28889
#> 3 11 27268
#> 4 12 28135
#> 5 2 24951
#> # ℹ 7 more rows
Just a reminder that all fastplyr functions are interchangeable with dplyr ones both ways
### With dplyr::count
|>
flights f_group_by(month) |>
count()
#> # A tibble: 12 × 2
#> # Groups: month [12]
#> month n
#> <int> <int>
#> 1 1 27004
#> 2 2 24951
#> 3 3 28834
#> 4 4 28330
#> 5 5 28796
#> # ℹ 7 more rows
### With dplyr::group_by
|>
flights group_by(month) |>
f_count()
#> # A tibble: 12 × 2
#> # Groups: month [12]
#> month n
#> <int> <int>
#> 1 1 27004
#> 2 2 24951
#> 3 3 28834
#> 4 4 28330
#> 5 5 28796
#> # ℹ 7 more rows
f_summarise
behaves like dplyr’s summarise
except for two things:
<- flights |>
grouped_flights group_by(across(where(is.character)))
|>
grouped_flights f_summarise(
n = n(), mean_dep_delay = mean(dep_delay)
)#> # A tibble: 52,807 × 6
#> carrier tailnum origin dest n mean_dep_delay
#> <chr> <chr> <chr> <chr> <int> <dbl>
#> 1 9E N146PQ JFK ATL 8 9.62
#> 2 9E N153PQ JFK ATL 5 -0.4
#> 3 9E N161PQ JFK ATL 3 -2
#> 4 9E N162PQ EWR DTW 1 160
#> 5 9E N162PQ JFK ATL 1 -6
#> # ℹ 52,802 more rows
And a benchmark
mark(
fastplyr_summarise = grouped_flights |>
f_summarise(
n = n(), mean_dep_delay = mean(dep_delay)
),dplyr_summarise = grouped_flights |>
summarise(
n = n(), mean_dep_delay = mean(dep_delay, na.rm = TRUE),
.groups = "drop"
)
)#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_summarise 3.29ms 4.33ms 214. 2.09MB 4.00
#> 2 dplyr_summarise 647.77ms 647.77ms 1.54 9.57MB 10.8
Joins work much the same way as in dplyr.
<- flights |>
left f_select(origin, dest, time_hour)
<- sample(unique(left$time_hour), 5000)
hours <- as.data.frame(unclass(as.POSIXlt(hours)))
right $time_hour <- hours
right
# Left join
|>
left f_left_join(right)
#> # A tibble: 336,776 × 14
#> origin dest time_hour sec min hour mday mon year wday
#> * <chr> <chr> <dttm> <dbl> <int> <int> <int> <int> <int> <int>
#> 1 EWR IAH 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 2 LGA IAH 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 3 JFK MIA 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 4 JFK BQN 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 5 LGA ATL 2013-01-01 06:00:00 0 0 6 1 0 113 2
#> # ℹ 336,771 more rows
#> # ℹ 4 more variables: yday <int>, isdst <int>, zone <chr>, gmtoff <int>
# inner join
|>
left f_inner_join(right)
#> # A tibble: 244,029 × 14
#> origin dest time_hour sec min hour mday mon year wday
#> <chr> <chr> <dttm> <dbl> <int> <int> <int> <int> <int> <int>
#> 1 EWR IAH 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 2 LGA IAH 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 3 JFK MIA 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 4 JFK BQN 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 5 LGA ATL 2013-01-01 06:00:00 0 0 6 1 0 113 2
#> # ℹ 244,024 more rows
#> # ℹ 4 more variables: yday <int>, isdst <int>, zone <chr>, gmtoff <int>
# Anti join
|>
left f_anti_join(right)
#> # A tibble: 92,747 × 3
#> origin dest time_hour
#> <chr> <chr> <dttm>
#> 1 LGA ATL 2013-01-01 14:00:00
#> 2 LGA ATL 2013-01-01 14:00:00
#> 3 EWR ORD 2013-01-01 14:00:00
#> 4 EWR SEA 2013-01-01 14:00:00
#> 5 EWR ORD 2013-01-01 14:00:00
#> # ℹ 92,742 more rows
# Semi join
|>
left f_semi_join(right)
#> # A tibble: 244,029 × 3
#> origin dest time_hour
#> <chr> <chr> <dttm>
#> 1 EWR IAH 2013-01-01 05:00:00
#> 2 LGA IAH 2013-01-01 05:00:00
#> 3 JFK MIA 2013-01-01 05:00:00
#> 4 JFK BQN 2013-01-01 05:00:00
#> 5 LGA ATL 2013-01-01 06:00:00
#> # ℹ 244,024 more rows
# full join
|>
left f_full_join(right)
#> # A tibble: 336,776 × 14
#> origin dest time_hour sec min hour mday mon year wday
#> * <chr> <chr> <dttm> <dbl> <int> <int> <int> <int> <int> <int>
#> 1 EWR IAH 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 2 LGA IAH 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 3 JFK MIA 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 4 JFK BQN 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 5 LGA ATL 2013-01-01 06:00:00 0 0 6 1 0 113 2
#> # ℹ 336,771 more rows
#> # ℹ 4 more variables: yday <int>, isdst <int>, zone <chr>, gmtoff <int>
And a benchmark comparing fastplyr and dplyr joins
mark(
fastplyr_left_join = f_left_join(left, right, by = "time_hour"),
dplyr_left_join = left_join(left, right, by = "time_hour")
)#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_left_join 10.1ms 13.4ms 75.5 19.3MB 75.5
#> 2 dplyr_left_join 36.5ms 36.6ms 27.2 45MB 72.5
mark(
fastplyr_inner_join = f_inner_join(left, right, by = "time_hour"),
dplyr_inner_join = inner_join(left, right, by = "time_hour")
)#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_inner_join 5.58ms 11.4ms 93.9 22.2MB 60.4
#> 2 dplyr_inner_join 25.06ms 30.1ms 32.8 37.9MB 43.7
mark(
fastplyr_anti_join = f_anti_join(left, right, by = "time_hour"),
dplyr_anti_join = anti_join(left, right, by = "time_hour")
)#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_anti_join 2.33ms 3.45ms 299. 3.76MB 24.3
#> 2 dplyr_anti_join 14.53ms 19.39ms 53.3 21.8MB 16.0
mark(
fastplyr_semi_join = f_semi_join(left, right, by = "time_hour"),
dplyr_semi_join = semi_join(left, right, by = "time_hour")
)#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_semi_join 3.67ms 5.8ms 177. 7.8MB 22.2
#> 2 dplyr_semi_join 14.75ms 18.5ms 55.4 21.9MB 29.3
mark(
fastplyr_full_join = f_full_join(left, right, by = "time_hour"),
dplyr_full_join = full_join(left, right, by = "time_hour")
)#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_full_join 9.64ms 14.1ms 73.5 20.6MB 27.6
#> 2 dplyr_full_join 28.18ms 32.4ms 31.4 44.6MB 78.5
f_slice
and other f_slice_
functions are
very fast for many groups.
|>
grouped_flights f_slice(1)
#> # A tibble: 52,807 × 19
#> # Groups: carrier, tailnum, origin, dest [52,807]
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 7 614 615 -1 812 855
#> 2 2013 1 8 612 615 -3 901 855
#> 3 2013 1 9 615 615 0 NA 855
#> 4 2013 1 25 1530 1250 160 1714 1449
#> 5 2013 2 24 609 615 -6 835 855
#> # ℹ 52,802 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
|>
grouped_flights f_slice_head(3)
#> # A tibble: 125,770 × 19
#> # Groups: carrier, tailnum, origin, dest [52,807]
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 7 614 615 -1 812 855
#> 2 2013 1 13 612 615 -3 853 855
#> 3 2013 2 3 617 615 2 902 855
#> 4 2013 1 8 612 615 -3 901 855
#> 5 2013 1 22 614 615 -1 857 855
#> # ℹ 125,765 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
A quick benchmark to prove the point
mark(
fastplyr_slice = grouped_flights |>
f_slice_head(n = 3),
dplyr_slice = grouped_flights |>
slice_head(n = 3)
)#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_slice 24.18ms 29.85ms 26.7 21.4MB 11.5
#> 2 dplyr_slice 3.32s 3.32s 0.301 26.6MB 10.2
In dplyr to work with group IDs you must use the
mutate()
+ cur_group_id()
paradigm.
In fastplyr you can just use add_group_id()
which is
blazing fast.
## Unique ID for each group
|>
grouped_flights add_group_id() |>
f_select(group_id)
#> Adding missing grouping variables: 'carrier', 'tailnum', 'origin', 'dest'
#> # A tibble: 336,776 × 5
#> # Groups: carrier, tailnum, origin, dest [52,807]
#> carrier tailnum origin dest group_id
#> <chr> <chr> <chr> <chr> <int>
#> 1 UA N14228 EWR IAH 35951
#> 2 UA N24211 LGA IAH 36937
#> 3 AA N619AA JFK MIA 8489
#> 4 B6 N804JB JFK BQN 15462
#> 5 DL N668DN LGA ATL 20325
#> # ℹ 336,771 more rows
Another benchmark
mark(
fastplyr_group_id = grouped_flights |>
add_group_id() |>
f_select(all_of(group_vars(grouped_flights)), group_id),
dplyr_group_id = grouped_flights |>
mutate(group_id = cur_group_id()) |>
select(all_of(group_vars(grouped_flights)), group_id)
)#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_group_id 2.24ms 2.77ms 342. 1.46MB 5.97
#> 2 dplyr_group_id 284.75ms 290.54ms 3.44 3.24MB 10.3
Based closely on tidyr::expand
, f_expand()
can cross joins multiple vectors and data frames.
mark(
fastplyr_expand = flights |>
f_group_by(origin, tailnum) |>
f_expand(month = 1:12),
tidyr_expand = flights |>
group_by(origin, tailnum) |>
expand(month = 1:12),
check = FALSE
)#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_expand 20.19ms 21.81ms 42.3 8.87MB 7.69
#> 2 tidyr_expand 3.83s 3.83s 0.261 81.02MB 3.65
# Using `.cols` in `f_expand()` is very fast!
mark(
fastplyr_expand = flights |>
f_group_by(origin, dest) |>
f_expand(.cols = c("year", "month", "day")),
tidyr_expand = flights |>
group_by(origin, dest) |>
expand(year, month, day),
check = FALSE
)#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_expand 15.9ms 19.3ms 49.9 15.6MB 9.97
#> 2 tidyr_expand 218.2ms 222.8ms 4.39 66.7MB 4.39
Finding duplicate rows is a very common dataset operation and there
is a dedicated function f_duplicates()
to do exactly
this.
|>
flights f_duplicates(time_hour)
#> # A tibble: 329,840 × 1
#> time_hour
#> <dttm>
#> 1 2013-01-01 05:00:00
#> 2 2013-01-01 05:00:00
#> 3 2013-01-01 05:00:00
#> 4 2013-01-01 05:00:00
#> 5 2013-01-01 06:00:00
#> # ℹ 329,835 more rows
Benchmark against a common dplyr strategy for finding duplicates
mark(
fastplyr_duplicates = flights |>
f_duplicates(time_hour, .both_ways = TRUE, .add_count = TRUE, .keep_all = TRUE),
dplyr_duplicates = flights |>
add_count(time_hour) |>
filter(n > 1)
)#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_duplicates 18.7ms 25.4ms 39.6 45.1MB 30.8
#> 2 dplyr_duplicates 65.8ms 70.3ms 14.4 59.5MB 23.9
In the worst-case scenarios, f_filter()
is about the
same speed as filter()
and in the best-case is much faster
and more efficient. This is especially true for large data where small
subsets of the data are returned.
<- new_tbl(x = rnorm(5e07))
full
# A worst case scenario
mark(
fastplyr_filter = full |>
f_filter(abs(x) > 0),
dplyr_filter = full |>
filter(abs(x) > 0)
)#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_filter 736ms 736ms 1.36 1.12GB 1.36
#> 2 dplyr_filter 946ms 946ms 1.06 1.68GB 2.11
# Best case scenario - filter results in small subset
mark(
fastplyr_filter = full |>
f_filter(x > 4),
dplyr_filter = full |>
filter(x > 4)
)#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_filter 172ms 176ms 5.73 191MB 0
#> 2 dplyr_filter 520ms 520ms 1.92 763MB 1.92
Binding columns is particular much faster but binding rows is also sufficiently faster
mark(
fastplyr_bind_cols = f_bind_cols(grouped_flights, grouped_flights),
dplyr_bind_cols = suppressMessages(
bind_cols(grouped_flights, grouped_flights)
)
)#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_bind_cols 58µs 65.9µs 14533. 48.55KB 5.31
#> 2 dplyr_bind_cols 243ms 243.1ms 4.11 1.31MB 8.23
mark(
fastplyr_bind_rows = f_bind_rows(grouped_flights, grouped_flights),
dplyr_bind_rows = bind_rows(grouped_flights, grouped_flights)
)#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_bind_rows 59ms 59.8ms 16.3 86.6MB 0
#> 2 dplyr_bind_rows 223ms 224ms 4.46 157.6MB 2.23
A typical tidy approach might use a mixture of reframe()
and enframe()
which is a perfectly tidy and neat
solution
<- seq(0, 1, 0.25)
probs
<- as_tbl(mtcars)
mtcars
|>
mtcars group_by(cyl) |>
reframe(enframe(quantile(mpg, probs), "quantile", "mpg"))
#> # A tibble: 15 × 3
#> cyl quantile mpg
#> <dbl> <chr> <dbl>
#> 1 4 0% 21.4
#> 2 4 25% 22.8
#> 3 4 50% 26
#> 4 4 75% 30.4
#> 5 4 100% 33.9
#> # ℹ 10 more rows
fastplyr though has a dedicated function for quantile calculation,
tidy_quantiles()
which requires less code to type
# Wide
|>
mtcars tidy_quantiles(mpg, .by = cyl, pivot = "wide")
#> # A tibble: 3 × 6
#> cyl p0 p25 p50 p75 p100
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 21.4 22.8 26 30.4 33.9
#> 2 6 17.8 18.6 19.7 21 21.4
#> 3 8 10.4 14.4 15.2 16.2 19.2
# Long
|>
mtcars tidy_quantiles(mpg, .by = cyl, pivot = "long")
#> # A tibble: 15 × 3
#> cyl .quantile mpg
#> <dbl> <fct> <dbl>
#> 1 4 p0 21.4
#> 2 4 p25 22.8
#> 3 4 p50 26
#> 4 4 p75 30.4
#> 5 4 p100 33.9
#> # ℹ 10 more rows
Not only can you choose how to pivot as shown above, you can also calculate quantiles for multiple variables.
<- mtcars |>
multiple_quantiles tidy_quantiles(across(where(is.numeric)), pivot = "long")
multiple_quantiles#> # A tibble: 5 × 12
#> .quantile mpg cyl disp hp drat wt qsec vs am gear carb
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 p0 10.4 4 71.1 52 2.76 1.51 14.5 0 0 3 1
#> 2 p25 15.4 4 121. 96.5 3.08 2.58 16.9 0 0 3 2
#> 3 p50 19.2 6 196. 123 3.70 3.32 17.7 0 0 4 2
#> 4 p75 22.8 8 326 180 3.92 3.61 18.9 1 1 4 4
#> 5 p100 33.9 8 472 335 4.93 5.42 22.9 1 1 5 8
# Quantile names is a convenient factor
$.quantile
multiple_quantiles#> [1] p0 p25 p50 p75 p100
#> Levels: p0 p25 p50 p75 p100
tidy_quantiles()
of course is fast when many groups are
involved.
mark(
fastplyr_quantiles = flights |>
tidy_quantiles(dep_delay, pivot = "long",
.by = c(year, month, day, origin)),
dplyr_quantiles = flights |>
group_by(year, month, day, origin) |>
reframe(enframe(quantile(dep_delay, seq(0, 1, 0.25), na.rm = TRUE))),
check = FALSE
)#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_quantiles 22.8ms 23.5ms 42.4 4.22MB 0
#> 2 dplyr_quantiles 214.4ms 214.8ms 4.65 24.98MB 4.65
Let’s run some more benchmarks for fun, this time including tidytable which fastplyr is very similar to as it also uses a tidy frontend but a data.table backend
<- 10^7
n_rows <- 10^6
n_groups
<- new_tbl(x = rnorm(n_rows))
tbl <- tbl |>
tbl mutate(y = as.character(round(x, 6)),
g = sample.int(n_groups, n_rows, TRUE))
tbl#> # A tibble: 10,000,000 × 3
#> x y g
#> <dbl> <chr> <int>
#> 1 1.29 1.285351 433366
#> 2 -1.61 -1.613842 887462
#> 3 -0.787 -0.787209 550879
#> 4 -0.490 -0.489809 875660
#> 5 0.393 0.393453 550619
#> # ℹ 9,999,995 more rows
For this we will be using the .by
argument from each
package. Because fastplyr still sorts the groups by default here we will
set an internal option to use the alternative grouping algorithm that
sorts groups by order of first appearance. This will likely be revisited
at some point.
To read about the differences, see ?collapse::GRP
.
library(tidytable)
#> Warning: tidytable was loaded after dplyr.
#> This can lead to most dplyr functions being overwritten by tidytable functions.
#> Warning: tidytable was loaded after tidyr.
#> This can lead to most tidyr functions being overwritten by tidytable functions.
#>
#> Attaching package: 'tidytable'
#> The following objects are masked from 'package:fastplyr':
#>
#> crossing, desc, nesting
#> The following objects are masked from 'package:dplyr':
#>
#> across, add_count, add_tally, anti_join, arrange, between,
#> bind_cols, bind_rows, c_across, case_match, case_when, coalesce,
#> consecutive_id, count, cross_join, cume_dist, cur_column, cur_data,
#> cur_group_id, cur_group_rows, dense_rank, desc, distinct, filter,
#> first, full_join, group_by, group_cols, group_split, group_vars,
#> if_all, if_any, if_else, inner_join, is_grouped_df, lag, last,
#> lead, left_join, min_rank, mutate, n, n_distinct, na_if, nest_by,
#> nest_join, nth, percent_rank, pick, pull, recode, reframe,
#> relocate, rename, rename_with, right_join, row_number, rowwise,
#> select, semi_join, slice, slice_head, slice_max, slice_min,
#> slice_sample, slice_tail, summarise, summarize, tally, top_n,
#> transmute, tribble, ungroup
#> The following objects are masked from 'package:purrr':
#>
#> map, map_chr, map_dbl, map_df, map_dfc, map_dfr, map_int, map_lgl,
#> map_vec, map2, map2_chr, map2_dbl, map2_df, map2_dfc, map2_dfr,
#> map2_int, map2_lgl, map2_vec, pmap, pmap_chr, pmap_dbl, pmap_df,
#> pmap_dfc, pmap_dfr, pmap_int, pmap_lgl, pmap_vec, walk
#> The following objects are masked from 'package:tidyr':
#>
#> complete, crossing, drop_na, expand, expand_grid, extract, fill,
#> nest, nesting, pivot_longer, pivot_wider, replace_na, separate,
#> separate_longer_delim, separate_rows, separate_wider_delim,
#> separate_wider_regex, tribble, uncount, unite, unnest,
#> unnest_longer, unnest_wider
#> The following objects are masked from 'package:tibble':
#>
#> enframe, tribble
#> The following objects are masked from 'package:stats':
#>
#> dt, filter, lag
#> The following object is masked from 'package:base':
#>
#> %in%
<- as_tidytable(tbl)
tidy_tbl
# Setting an internal option to set all grouping to use the non-sorted type
options(.fastplyr.order.groups = FALSE)
mark(
fastplyr_slice = tbl |>
f_slice(3:5, .by = g),
tidytable_slice = tidy_tbl |>
slice(3:5, .by = g),
check = FALSE,
min_iterations = 3
)#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_slice 780.4ms 921.61ms 1.14 133MB 0.757
#> 2 tidytable_slice 6.2s 6.25s 0.159 176MB 1.80
mark(
fastplyr_slice_head = tbl |>
f_slice_head(n = 3, .by = g),
tidytable_slice_head = tidy_tbl |>
slice_head(n = 3, .by = g),
fastplyr_slice_tail = tbl |>
f_slice_tail(n = 3, .by = g),
tidytable_slice_tail = tidy_tbl |>
slice_tail(n = 3, .by = g),
check = FALSE,
min_iterations = 3
)#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_slice_head 722.23ms 817.01ms 1.26 191MB 0
#> 2 tidytable_slice_head 1.6s 1.62s 0.590 175MB 1.38
#> 3 fastplyr_slice_tail 763.4ms 796.16ms 1.26 194MB 0
#> 4 tidytable_slice_tail 3.39s 3.45s 0.281 175MB 1.69
Here we’ll calculate the mean of x by each group of g
Both tidytable and fastplyr have optimisations for
mean()
when it involves groups. tidytable internally uses
data.table’s ‘gforce’ mean function. This is basically a dedicated C
function to calculate means for many groups.
mark(
fastplyr_sumarise = tbl |>
f_summarise(mean = mean(x), .by = g),
tidytable_sumarise = tidy_tbl |>
summarise(mean = mean(x), .by = g, .sort = FALSE),
check = FALSE,
min_iterations = 3
)#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_sumarise 275ms 286ms 3.43 57.2MB 0
#> 2 tidytable_sumarise 242ms 244ms 4.09 305.4MB 2.05
Benchmarking more statistical functions
mark(
fastplyr_sumarise2 = tbl |>
f_summarise(n = n(), mean = mean(x), min = min(x), max = max(x), .by = g),
tidytable_sumarise2 = tidy_tbl |>
summarise(n = n(), mean = mean(x), min = min(x), max = max(x),
.by = g, .sort = FALSE),
check = FALSE,
min_iterations = 3
)#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_sumarise2 348ms 360ms 2.72 72.5MB 0
#> 2 tidytable_sumarise2 386ms 393ms 2.55 320.7MB 1.27
mark(
fastplyr_count = tbl |>
f_count(y, g),
tidytable_count = tidy_tbl |>
count(y, g),
check = FALSE,
min_iterations = 3
)#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_count 346.42ms 354.08ms 2.82 229MB 1.41
#> 2 tidytable_count 3.39s 3.39s 0.295 496MB 0.589
It’s clear both fastplyr and tidytable are fast and each have their strengths and weaknesses.