Combining common identifiers in 2 lists of data tables

advertisements

I have two lists, each containing a few thousand data tables. The data tables contain id's and each id will only appear once within each list. Additionally, each data table will have different columns, though they will share column names with some other data tables. For example, in my lists created below, id 1 appears in the 1st data table in list1 and the 2nd data table in list2. In the first list id 1 has data for columns 'a' and 'd' and in the second list it has columns for 'a' and 'b'.

library(data.table)
# Create 2 lists of data frames
list1 <- list(data.table(id=c(1,3), a=c(0,0), d=c(1,1)),
              data.table(id=c(2,4), b=c(1,0), c=c(2,1), f=c(3,1)),
              data.table(id=c(5,6), a=c(4,0), b=c(2,1)))

list2 <- list(data.table(id=c(2,3,6), c=c(0,0,1), d=c(1,1,0), e=c(0,1,2)),
              data.table(id=c(1,4,5), a=c(1,0,3), b=c(2,1,2)))

What I need to do is find the id in each list, and average their results.

 list id  a  b  d
list1  1  0 NA  1
list2  1  1  2 NA

NA values are treated as 0, so the result for id 1 should be:

id   a b   d
 1 0.5 1 0.5

Next, the top 3 column names are selected and ordered based on their values so that the result is:

id    top3
 1   b d a

This needs to be repeated for all id's. I have code that can achieve this (below), but for a large list with thousands of data tables and over a million ids it is very slow.

for (i in 1:6){ # i is the id to be searched for
  for (j in 1:length(list1)){
    if (i %in% list1[[j]]$id){
      listnum1 <- j
      rownum1 <- which(list1[[j]]$id==i)
      break
    }
  }

  for (j in 1:length(list2)){
    if (i %in% list2[[j]]$id){
      listnum2 <- j
      rownum2 <- which(list2[[j]]$id==i)
      break
    }
  }

  v1 <- data.table(setDF(list1[[listnum1]])[rownum1,]) # Converting to data.frame using setDF and extracting the row is faster than using data.table
  v2 <- data.table(setDF(list2[[listnum2]])[rownum2,])
  bind <- rbind(v1, v2, fill=TRUE) # Combines two rows and fills in columns they don't have in common
  for (j in 1:ncol(bind)){ # Convert NAs to 0
    set(bind, which(is.na(bind[[j]])), j, 0)}
  means <- colMeans(bind[,2:ncol(bind),with=F]) # Average the two rows
  col_ids <- as.data.table(t(names(sort(means)[length(means):(length(means)-2)])))

  # select and order the top 3 ids and bind to a data frame
  top3 <- rbind(top3, cbind(id=i, top3=data.table(do.call("paste", c(col_ids[,1:min(length(col_ids),3),with=F], sep=" ")))))
}

   id top3.V1
1:  1   b d a
2:  2   f c d
3:  3   d e c
4:  4   f c b
5:  5     a b
6:  6   e c b

When I run this code on my full data set (which has a few million IDs) it only makes it through about 400 ids after about 60 seconds. It would take days to go through the entire data set. Converting each list into 1 much larger data table is not an option; there are 100,000 possible columns so it becomes too large. Is there a faster way to achieve the desired result?

Melt down the individual data.table's and you won't run into the issue of wasted memory:

rbindlist(lapply(c(list1, list2), melt, id.var = 'id', variable.factor = F))[
  # find number of "rows" per id
  , nvals := max(rle(sort(variable))$lengths), by = id][
  # compute the means, assuming that missing values are equal to 0
  , sum(value)/nvals[1], by = .(id, variable)][
  # extract top 3 values
  order(-V1), paste(head(variable, 3), collapse = " "), keyby = id]
#   id    V1
#1:  1 b a d
#2:  2 f c b
#3:  3 d e a
#4:  4 b c f
#5:  5   a b
#6:  6 e b c

Or instead of rle you can do:

rbindlist(lapply(c(list1, list2), melt, id.var = 'id'))[
  , .(vals = sum(value), nvals = .N), by = .(id, variable)][
  , vals := vals / max(nvals), by = id][
  order(-vals), paste(head(variable, 3), collapse = " "), keyby = id]

Or better yet, as Frank points out, don't even bother with the mean:

rbindlist(lapply(c(list1, list2), melt, id.var = 'id'))[
  , sum(value), by = .(id, variable)][
  order(-V1), paste(head(variable, 3), collapse = " "), keyby = id]

Tags data.table

Combining common identifiers in 2 lists of data tables

Combining common identifiers in 2 lists of data tables

Recommend

Viewing the title on the coverflow image template for iphone

The iPhone simulator shows only a black screen after launching the application

Overlapping masonry photos and is not properly integrated

Is it possible to create an enumeration (enumeration) in ABAP?

How do I specify the information about STDERR when a method generates stderr?

GitHub - mgeeky/PackMyPayload: A PoC that packages payloads into output containe...

Clear Search History in Visual Studio 2008

Do not add & ldquo; + & rdquo; to the Linux kernel version

Free Oracle Cloud Kubernetes cluster with Terraform

GitHub - SliShady/BingDwenDwen: 使用python绘制冰墩墩

About Joyk