
Summary
Lists as a data type can be confusing but also useful. They can hold data of different types and lengths, making them very versatile. Lists can be named or nested and have the same or different lengths. This post deals with converting a list to a dataframe when it has unequal lengths.Table of Contents
Create a Named List
First, we’ll create a named, nested list of different lengths. (This is a list of named lists). This example comes from blogging where authors will assign a category and tags to a post. The categories and tags may have just one or many values.
# my list (my.l)
my.l <- list()
my.l[[1]] <- list(categories = "R", tags = "list")
my.l[[2]] <- list(categories = "R", tags = c("list", "dataframe"))
names(my.l) <- c("post_1", "post_2")
Generate error
Several methods for converting lists to dataframe can be found in this stackoverflow question. However, when the most popular method is applied to the above list, it generates an error because the ‘tags’ variable has one value in post_1
and two values in post_2
. The error reads, “invalid list argument: all variables should have the same length.” This is a common problem when scraping webpages where the html_nodes
function will sometimes capture multiple values from a page. Others have noted the problem when an api call is made and the response is returned with missing values.
# Top solution on Stackoverflow
do.call(rbind.data.frame, my.l)
Error in (function (..., deparse.level = 1, make.row.names = TRUE, stringsAsFactors = default.stringsAsFactors(), : invalid list argument: all variables should have the same length
Simple Solution
Probably, the fastest and most direct method is to use the rbindlist
function from the data.table
package. Note the list names are omitted.
data.table::rbindlist(my.l, fill = TRUE)
categories tags
1: R list
2: R list
3: R dataframe
Not-so-simple Solution
This is the not-so-simple solution. It introduced me to a new apply funcion rapply
. It recursively applies a function to a list so will work in nested list situations. Collapsing all of the values into a single column in a data.frame allows me to easily inspect the differences as it iterates over the list. It also allows me flexibility to split the column by row or column.
# combine with info from list page
new.l <- rapply(my.l, function(x) paste(x, collapse = "|"), how = "replace")
# fast
dt <- data.table::rbindlist(new.l)
dt$names <- names(new.l)
dt
categories tags names
1: R list post_1
2: R list|dataframe post_2
Separate Rows
Using the example above, separating by row makes more sense as it would allow to the dataframe to be filtered by both category and tag.
dt %>% tidyr::separate_rows(tags, sep = "\\|")
# A tibble: 3 x 3
categories tags names
<chr> <chr> <chr>
1 R list post_1
2 R list post_2
3 R dataframe post_2
Separate Columns
Using the example above, you can also separate the column by a character as well. I’m not sure it makes a lot of sense for this example.
dt %>% tidyr::separate(tags, into = c("tag_1", "tag_2"), sep = "\\|")
categories tag_1 tag_2 names
1: R list <NA> post_1
2: R list dataframe post_2
Other Packages
Two other packages offer similar functionality. The first package purrr
has a function map_dfr
which returns a data frame created by row-binding and column-binding respectively. [1] The second package is rlist
which has functions list.rbind
and list.cbind
for the task. [2]
Acknowledgements
This blog post was made possible thanks to:
References
Disclaimer
The views, analysis and conclusions presented within this paper represent the author’s alone and not of any other person, organization or government entity. While I have made every reasonable effort to ensure that the information in this article was correct, it will nonetheless contain errors, inaccuracies and inconsistencies. It is a working paper subject to revision without notice as additional information becomes available. Any liability is disclaimed as to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause. The author(s) received no financial support for the research, authorship, and/or publication of this article.
Reproducibility
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 3.6.3 (2020-02-29)
os macOS Catalina 10.15.7
system x86_64, darwin15.6.0
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Chicago
date 2021-04-04
─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
blogdown * 1.2 2021-03-04 [1] CRAN (R 3.6.3)
bookdown 0.21 2020-10-13 [1] CRAN (R 3.6.3)
bslib 0.2.4 2021-01-25 [1] CRAN (R 3.6.2)
cachem 1.0.4 2021-02-13 [1] CRAN (R 3.6.2)
callr 3.5.1 2020-10-13 [1] CRAN (R 3.6.2)
cli 2.3.1 2021-02-23 [1] CRAN (R 3.6.3)
codetools 0.2-18 2020-11-04 [1] CRAN (R 3.6.2)
colorspace 2.0-0 2020-11-11 [1] CRAN (R 3.6.2)
crayon 1.4.1 2021-02-08 [1] CRAN (R 3.6.2)
DBI 1.1.1 2021-01-15 [1] CRAN (R 3.6.2)
desc 1.3.0 2021-03-05 [1] CRAN (R 3.6.3)
devtools * 2.3.2 2020-09-18 [1] CRAN (R 3.6.2)
digest 0.6.27 2020-10-24 [1] CRAN (R 3.6.2)
dplyr 1.0.5 2021-03-05 [1] CRAN (R 3.6.3)
ellipsis 0.3.1 2020-05-15 [1] CRAN (R 3.6.2)
evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
fansi 0.4.2 2021-01-15 [1] CRAN (R 3.6.2)
farver 2.1.0 2021-02-28 [1] CRAN (R 3.6.3)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 3.6.2)
fs 1.5.0 2020-07-31 [1] CRAN (R 3.6.2)
generics 0.1.0 2020-10-31 [1] CRAN (R 3.6.2)
ggplot2 * 3.3.3 2020-12-30 [1] CRAN (R 3.6.2)
ggthemes * 4.2.4 2021-01-20 [1] CRAN (R 3.6.2)
glue 1.4.2 2020-08-27 [1] CRAN (R 3.6.2)
gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.0)
highr 0.8 2019-03-20 [1] CRAN (R 3.6.0)
htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 3.6.2)
jquerylib 0.1.3 2020-12-17 [1] CRAN (R 3.6.2)
jsonlite 1.7.2 2020-12-09 [1] CRAN (R 3.6.2)
knitr 1.31 2021-01-27 [1] CRAN (R 3.6.2)
labeling 0.4.2 2020-10-20 [1] CRAN (R 3.6.2)
lifecycle 1.0.0 2021-02-15 [1] CRAN (R 3.6.2)
magrittr 2.0.1 2020-11-17 [1] CRAN (R 3.6.2)
memoise 2.0.0 2021-01-26 [1] CRAN (R 3.6.2)
munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.0)
pillar 1.5.1 2021-03-05 [1] CRAN (R 3.6.3)
pkgbuild 1.2.0 2020-12-15 [1] CRAN (R 3.6.2)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.0)
pkgload 1.2.0 2021-02-23 [1] CRAN (R 3.6.3)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.0)
processx 3.4.5 2020-11-30 [1] CRAN (R 3.6.2)
ps 1.6.0 2021-02-28 [1] CRAN (R 3.6.3)
purrr 0.3.4 2020-04-17 [1] CRAN (R 3.6.2)
R6 2.5.0 2020-10-28 [1] CRAN (R 3.6.2)
remotes 2.2.0 2020-07-21 [1] CRAN (R 3.6.2)
rlang 0.4.10 2020-12-30 [1] CRAN (R 3.6.2)
rmarkdown 2.7 2021-02-19 [1] CRAN (R 3.6.3)
rprojroot 2.0.2 2020-11-15 [1] CRAN (R 3.6.2)
sass 0.3.1 2021-01-24 [1] CRAN (R 3.6.2)
scales 1.1.1 2020-05-11 [1] CRAN (R 3.6.2)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
stringi 1.5.3 2020-09-09 [1] CRAN (R 3.6.2)
stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.0)
testthat 3.0.2 2021-02-14 [1] CRAN (R 3.6.2)
tibble 3.1.0 2021-02-25 [1] CRAN (R 3.6.3)
tidyselect 1.1.0 2020-05-11 [1] CRAN (R 3.6.2)
usethis * 2.0.1 2021-02-10 [1] CRAN (R 3.6.2)
utf8 1.1.4 2018-05-24 [1] CRAN (R 3.6.0)
vctrs 0.3.6 2020-12-17 [1] CRAN (R 3.6.2)
withr 2.4.1 2021-01-26 [1] CRAN (R 3.6.2)
xfun 0.21 2021-02-10 [1] CRAN (R 3.6.2)
yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.0)
[1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library