R is a great tool, but processing data in large text files is cumbersome.
chunked helps you to process large text files with dplyr while loading only a part of the data in memory.
It builds on the excellent R package LaF.
Processing commands are written in dplyr syntax, and
LaF) will take care that chunk by chunk is
processed, taking far less memory than otherwise.
chunked is useful for select-ing columns, mutate-ing columns
and filter-ing rows. It is less helpful in group-ing and summarize-ation of large text files. It can be used in
'chunked' can be installed with
beta version with:
install.packages('chunked', repos=c('https://cran.rstudio.com', 'https://edwindj.github.io/drat'))
and the development version with:
Enjoy! Feedback is welcome...
Most common case is processing a large text file, select or add columns, filter it and write the result back to a text file
read_chunkwise("./large_file_in.csv", chunk_size=5000) %>% select(col1, col2, col5) %>% filter(col1 > 10) %>% mutate(col6 = col1 + col2) %>% write_chunkwise("./large_file_out.csv")
chunked will write process the above statement in chunks of 5000 records. This is different from for example
read.csv which reads all data into memory before processing it.
Another option is to use
chunked as a preprocessing step before adding it to a database
db <- src_sqlite('test.db', create=TRUE) tbl <- read_chunkwise("./large_file_in.csv", chunk_size=5000) %>% select(col1, col2, col5) %>% filter(col1 > 10) %>% mutate(col6 = col1 + col2) %>% write_chunkwise(db, 'my_large_table') # tbl now points to the table in sqlite.
Chunked can be used to export chunkwise to a text file. Note however that in that case processing takes place in the database and the chunkwise restrictions only apply to the writing.
chunked will not start processing until
write_chunkwise is called.
data_chunks <- read_chunkwise("./large_file_in.csv", chunk_size=5000) %>% select(col1, col3) # won't start processing until collect(data_chunks) # or write_chunkwise(data_chunks, "test.csv") # or write_chunkwise(data_chunks, db, "test")
Syntax completion of variables of a chunkwise file in RStudio works like a charm...
chunked implements the following dplyr verbs:
Since data is processed in chunks, some dplyr verbs are not implemented:
group_by are implemented but generate a warning: they operate on each chunk and
not on the whole data set. However this makes is more easy to process a large file, by repeatedly
aggregating the resulting data.
tmp <- tempfile() write.csv(iris, tmp, row.names=FALSE, quote=FALSE) iris_cw <- read_chunkwise(tmp, chunk_size = 30) # read in chunks of 30 rows for this example iris_cw %>% group_by(Species) %>% # group in each chunk summarise( m = mean(Sepal.Width) # and summarize in each chunk , w = n() ) %>% as.data.frame %>% # since each Species has 50 records, results will be in multiple chunks group_by(Species) %>% # group the results from the chunk summarise(m = weighted.mean(m, w)) # and summarize it again