0

Combining common identifiers in 2 lists of data tables

 2 years ago
source link: https://www.codesd.com/item/combining-common-identifiers-in-2-lists-of-data-tables.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Combining common identifiers in 2 lists of data tables

advertisements

I have two lists, each containing a few thousand data tables. The data tables contain id's and each id will only appear once within each list. Additionally, each data table will have different columns, though they will share column names with some other data tables. For example, in my lists created below, id 1 appears in the 1st data table in list1 and the 2nd data table in list2. In the first list id 1 has data for columns 'a' and 'd' and in the second list it has columns for 'a' and 'b'.

library(data.table)
# Create 2 lists of data frames
list1 <- list(data.table(id=c(1,3), a=c(0,0), d=c(1,1)),
              data.table(id=c(2,4), b=c(1,0), c=c(2,1), f=c(3,1)),
              data.table(id=c(5,6), a=c(4,0), b=c(2,1)))

list2 <- list(data.table(id=c(2,3,6), c=c(0,0,1), d=c(1,1,0), e=c(0,1,2)),
              data.table(id=c(1,4,5), a=c(1,0,3), b=c(2,1,2)))

What I need to do is find the id in each list, and average their results.

 list id  a  b  d
list1  1  0 NA  1
list2  1  1  2 NA

NA values are treated as 0, so the result for id 1 should be:

id   a b   d
 1 0.5 1 0.5

Next, the top 3 column names are selected and ordered based on their values so that the result is:

id    top3
 1   b d a

This needs to be repeated for all id's. I have code that can achieve this (below), but for a large list with thousands of data tables and over a million ids it is very slow.

for (i in 1:6){ # i is the id to be searched for
  for (j in 1:length(list1)){
    if (i %in% list1[[j]]$id){
      listnum1 <- j
      rownum1 <- which(list1[[j]]$id==i)
      break
    }
  }

  for (j in 1:length(list2)){
    if (i %in% list2[[j]]$id){
      listnum2 <- j
      rownum2 <- which(list2[[j]]$id==i)
      break
    }
  }

  v1 <- data.table(setDF(list1[[listnum1]])[rownum1,]) # Converting to data.frame using setDF and extracting the row is faster than using data.table
  v2 <- data.table(setDF(list2[[listnum2]])[rownum2,])
  bind <- rbind(v1, v2, fill=TRUE) # Combines two rows and fills in columns they don't have in common
  for (j in 1:ncol(bind)){ # Convert NAs to 0
    set(bind, which(is.na(bind[[j]])), j, 0)}
  means <- colMeans(bind[,2:ncol(bind),with=F]) # Average the two rows
  col_ids <- as.data.table(t(names(sort(means)[length(means):(length(means)-2)])))

  # select and order the top 3 ids and bind to a data frame
  top3 <- rbind(top3, cbind(id=i, top3=data.table(do.call("paste", c(col_ids[,1:min(length(col_ids),3),with=F], sep=" ")))))
}

   id top3.V1
1:  1   b d a
2:  2   f c d
3:  3   d e c
4:  4   f c b
5:  5     a b
6:  6   e c b

When I run this code on my full data set (which has a few million IDs) it only makes it through about 400 ids after about 60 seconds. It would take days to go through the entire data set. Converting each list into 1 much larger data table is not an option; there are 100,000 possible columns so it becomes too large. Is there a faster way to achieve the desired result?


Melt down the individual data.table's and you won't run into the issue of wasted memory:

rbindlist(lapply(c(list1, list2), melt, id.var = 'id', variable.factor = F))[
  # find number of "rows" per id
  , nvals := max(rle(sort(variable))$lengths), by = id][
  # compute the means, assuming that missing values are equal to 0
  , sum(value)/nvals[1], by = .(id, variable)][
  # extract top 3 values
  order(-V1), paste(head(variable, 3), collapse = " "), keyby = id]
#   id    V1
#1:  1 b a d
#2:  2 f c b
#3:  3 d e a
#4:  4 b c f
#5:  5   a b
#6:  6 e b c

Or instead of rle you can do:

rbindlist(lapply(c(list1, list2), melt, id.var = 'id'))[
  , .(vals = sum(value), nvals = .N), by = .(id, variable)][
  , vals := vals / max(nvals), by = id][
  order(-vals), paste(head(variable, 3), collapse = " "), keyby = id]

Or better yet, as Frank points out, don't even bother with the mean:

rbindlist(lapply(c(list1, list2), melt, id.var = 'id'))[
  , sum(value), by = .(id, variable)][
  order(-V1), paste(head(variable, 3), collapse = " "), keyby = id]

Tags data.table

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK