Question: Merge data rich in NA
gravatar for _r_am
4.0 years ago by
Baylor College of Medicine, Houston, TX
_r_am31k wrote:

Hello everyone!

Here are the disclaimers first:

  • I am not sure this is a pure bioinformatics question, but then I've been working on this for 2 weeks now and I'm a bioinformatician so I guess it qualifies
  • It's going to be a (kinda) long question, a typical Ram style rant, so strap in.

I'm trying to integrate data on biological samples from ~35 tabular data files. These could be TXT, CSV or XLS* files. I extracted the content from each file to its own data.frame and then set out merging these data frames to get one huge data frame.

All data frames have an ID column, and that's the column I'd use if I were SQL-JOINing them. However, they can also share other columns. So, to give an extremely simplified example, consider these two files:


id    col1    col2    col3
1    A1    B1    C1
2    A2    B2    <NA>
3    <NA>  B3    C3
3    <NA> <NA> C4


id    col1    col4    col3
1    A1    D1    C1
2    A2    D2    C2
3    <NA>  D3    <NA>

My desired output is:

id    col1    col2    col3 col4
1    A1    B1    C1    D1
2    A2    B2    C2    D2
3    <NA>  B3    C3    D3
3    <NA> <NA> C4    D3

Essentially, for each column, among matching rows, pick the non-NA value if only one such value exists. If multiple non-NA values exist, create a new row for each.

I'm using a merge function with by=intersect(colnames(df1),colnames(df2)),all=T. Is there any way I can use a custom match function in the merge function? Am I misunderstanding something about merge and being overly cautious with the by=intersect? Should I abandon R and try and tackle this with a programming language?

What do you recommend?

data python R • 1.3k views
ADD COMMENTlink modified 4.0 years ago by Liun30 • written 4.0 years ago by _r_am31k
gravatar for WouterDeCoster
4.0 years ago by
WouterDeCoster44k wrote:

Should I abandon R

Yes. All of us should.

In all seriousness, I would tackle this with a python script. Perhaps because I tackle everything with a python script, but that's not relevant right now :) Probably good R solutions exist but I usually can't wrap my head around stuff like this in R. This sounds like quite a mess too.

I would in python create a dictionary in which each key is an identifier of which the value is again a dictionary in which each column is a key and the value is a list containing the non-NA values which are extracted after sequentially looping over the files. After getting data out of all files I would iterate over the dictionary and write line(s).


samplesdict = {}
for file in allmyfiles:
    for line in file:
      linelist = line.split('\t')
      if line[0] in samplesdict.keys(): #check if this sample is already in the dictionary, assuming identifier is always first field
         #add my information to samplesdict[line[0]][col1], samplesdict[line[0]][col2], samplesdict[line[0]][col27] and samplesdict[line[0]][col36]
         #create entry for this sample

#get which cols are available in all samples
allfields = []
for sample in samplesdict.keys():

#after getting all information out of files
for sample in samplesdict.keys():
    if max([len(x) for x in samplesdict[sample].values()]) > 1:
        #create two lines for this


ADD COMMENTlink written 4.0 years ago by WouterDeCoster44k

Rant: Thanks for the primer. I invested a TON of time learning R and I hate Python because of its white space thing - why would I let a programming language dictate my white space usage? I'll probably end up using it anyway or maybe go to Perl - who knows!

ADD REPLYlink written 4.0 years ago by _r_am31k

If you feel more comfortable putting curly braces around everything you do that's fine :-) And yes, Perl is great if you want your code to look the same before and after RSA encryption.

ADD REPLYlink written 4.0 years ago by WouterDeCoster44k

I grew up around C-based languages. Python's whitespacing makes my line wrapping techniques go haywire. Plus it claims to be object oriented while it's a very filthy implementation of OO, much like (albeit better than) Perl. One too many mismatching quirks for me :)

Off topic rant: Best object orientation I've ever seen is C#, Java comes really really close, except for when it allows static members to be accessed by object references. Static members belong to the class, dammit, not to an object! Although, Java's extends BASE_CLASS implements INTERFACE1,INTERFACE2 is definitely more detailed than C#'s class Derived : Base, Isomething1,Isomething2

ADD REPLYlink written 4.0 years ago by _r_am31k

I'll use Python to read in the files with their headers into a Python (pandas?) data frame - does that sound OK? Would I be a terrible person if I ran a "cursor" through a Python (pandas?) data frame? In other words, which would make me more acceptable to society - iterating through a python structure that holds a collection of the tabular data manually or iterating through an R DF manually?

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by _r_am31k

I'm not sure if a python/pandas data frame would make things easier. I'm kinda fond of nested dictionaries for stuff like this, but that's probably also frowned upon by the average man on the street.

ADD REPLYlink written 4.0 years ago by WouterDeCoster44k
gravatar for Liun
4.0 years ago by
Liun30 wrote:
    if(length(naindex)==0) return(v1)
    for(i in naindex){
    for(i in 1:dim(ldf)[1]){
    for(i in 1:dim(rdf)[1]){
ADD COMMENTlink written 4.0 years ago by Liun30

There would be many mistakes in the code if you use it for other format input.since i wrote it by myself.And i am sorry about that.

The function using r package dplyr,and i agree with Mr Wang's opinion, so i use the package.

The function need two matrix or data frame as input.The two matrix must have a unique column which it's 'id' in your example.And not both matrix have duplicated value in the unique column or the function will breakdown.

If the function occurs other strange errors please write me.

ADD REPLYlink modified 4.0 years ago by _r_am31k • written 4.0 years ago by Liun30

Thank you - this looks great. I will work on adapting this for my purposes. There are duplications on the supposed-to-be unique columns on both data frames, so I may have to work around that somehow - I will figure it out.

ADD REPLYlink written 4.0 years ago by _r_am31k
> df1
  id col1 col2 col3
1  1   A1   B1   C1
2  2   A2   B2 <NA>
3  3 <NA>   B3   C3
4  3 <NA> <NA>   C4
> df2
  id col1 col4 col3
1  1   A1   D1   C1
2  2   A2   D2   C2
3  3 <NA>   D3 <NA>
> mergeDIY(df1,df2,"id")
  id col1 col2 col3 col4
1  1   A1   B1   C1   D1
2  2   A2   B2   C2   D2
3  3 <NA>   B3   C3   D3
4  3 <NA> <NA>   C4   D3
ADD REPLYlink written 4.0 years ago by Liun30
gravatar for zjhzwang
4.0 years ago by
zjhzwang180 wrote:
file1_df<-tbl_df(read.table("PATH1",header=T,stringsAsFactors = F,na.strings = "<NA>"))
file2_df<-tbl_df(read.table("PATH2",header=T,stringsAsFactors = F,na.strings = "<NA>"))
  }else {

And the result is :

> final_result
     id  col1 col2 col3 col4
[1,] "1" "A1" "B1" "C1" "D1"
[2,] "2" "A2" "B2" "C2" "D2"
[3,] "3" "NA" "B3" "C3" "D3"
[4,] "3" "NA" "B3" "C4" "D3"
ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by zjhzwang180

Thank you. I've been using dplyr where it's appropriate, but did not mention it here because it leads to the same problem I'm having right now, which is that it gives me the data frame you show above instead of the data frame I'm looking for.

Pro-tip: The "Your requirement is wrong" response on these kind of posts doesn't make sense. I know what I want, so this is a case of you misunderstanding my requirement, not me being "wrong".

ADD REPLYlink written 4.0 years ago by _r_am31k

Sorry,maybe I misunderstand your meanings,I will check my answer again.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by zjhzwang180

zjhzwang edited the comment above. I'm adding this line to anyone confused on the conversation flow. The comment above used to read:

There is no such a line like this : 3 <na> B3 C3 D3

My response to it:

There is no such row in your data frame (and my current one), good sir (/madam). There is definitely one such row in my requirement.

I'm saying "I have Y, I want X". You're saying "No, you want X". Either you're incredibly foresightful (which can definitely be the case, for a lot of people including myself are often blinded by immediate details), or there's a communication gap.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by _r_am31k

I have updated my answer,and if my words make you uncomfortable, I appolgize

ADD REPLYlink written 4.0 years ago by zjhzwang180

Not at all - I like having my basic assumptions questioned. If that was your intention, you have my undivided attention :)

ADD REPLYlink written 4.0 years ago by _r_am31k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2282 users visited in the last hour