Question: Excluding columns in a dataframe based on a character string of column names to exclude
gravatar for confusedious
5.0 years ago by
confusedious420 wrote:

Hello everyone,

I apologise in advance if the terminology used in the title is misleading; I am not totally familiar with all object type terms, but I believe what I have posted is at least mostly correct.

I have a script for extracting sequences from a phyDat object (see packages 'ape' and 'phangorn') in R that is based on using subset and and a character string of the column names I wish to retain. See code below:

newalign <- as.phyDat(subset(aligndf, select = seqkeep))

In this case, 'aligndf' is the complete original alignment that has been transformed into a data frame in an earlier part of the script. Here I use 'subset' and 'select' to generate a new alignment object via 'as.phyDat' that consists only of the sequence names contained in the object 'seqkeep'. As an example, the contents of 'seqkeep' looks like the following:

[1] "hominin23"                                                           
[2] "hominin33"                                                                      
[3] "hominin47"

This procedure works well, and from this I gain exactly what I wanted, which is a new alignment that consists only of the sequences given in 'seqkeep'.

When I try to then write a second alignment that consists only of the sequences not in 'seqkeep', I have encountered a problem. No matter what I have tried, the resulting alignment is the complete original alignment that still includes the 'seqkeep' sequences.

Here are my most recent attempts based on some guides I have seen online:

remainalign <- as.phyDat(subset(aligndf, aligndf =! seqkeep))

remainalign <- as.phyDat(subset(aligndf, !(aligndf == seqkeep)))

Could anyone advise me on how to correctly render this task in R?

Thank you for your help.

data frame R • 2.4k views
ADD COMMENTlink modified 5.0 years ago by Michael Dondrup48k • written 5.0 years ago by confusedious420



This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by Michael Dondrup48k
gravatar for Michael Dondrup
5.0 years ago by
Bergen, Norway
Michael Dondrup48k wrote:

Don't use the subset function, the normal subsetting is much more readable. I am having problems determining the structure of your data please post head(seqkeep) and aligndf, so we can see the column names. You possibly want something like:

aligndf[ ,!(colnames(aligndf) %in% seqkeep)]

Edit: changed to column selection, it is unlcear what your goal is here.

or even simpler

aligndf[,seqkeep] # if rownames are compatible with 
# seqkeep and all seqkeep are in rownames

extracting in R is normally very straight forward to code and read

see ?match ?extract ?Comparison

In my R build the following is less readable but slightly faster than %in%:

aligndf[,!match(aligndf, seqkeep, nomatch=0)] # if you need to do that often

you can further speed this up using package fastmatch

Also, we are moving far away from bioinformatics here.

subset(aligndf, aligndf =! seqkeep) # what's wrong?

Also subset works on rows, not columns by default.

You are trying to extract the column aligndf from aligndf and trying to comparing it to a smaller vector using the non-exiting operator =!. You meant !=, but comparison is not the same as set operation, and == or != are not the right operators. It is just coincidence that it didn't throw an error in the first place.

ADD COMMENTlink modified 13 months ago by _r_am32k • written 5.0 years ago by Michael Dondrup48k

Also, I apologise if the content drifted a little too far into basic scripting as opposed to bioinformatics proper.

I have always had a much more positive experience getting answers here than on StackOverflow; people here are more understanding about the fact that it can take time for someone from a biology background to fully grasp handling computational and scripting issues effectively.

ADD REPLYlink written 5.0 years ago by confusedious420

Thank you Michael.

The below solved my problem nicely:

aligndf[ ,!(colnames(aligndf) %in% seqkeep)]

The object 'seqkeep' was a list of column names. I have now successfully altered the script so that it writes out two new .fasta alignments. The first with only the sequences in 'seqkeep' and the second with only those not in 'seqkeep'.

ADD REPLYlink modified 13 months ago by _r_am32k • written 5.0 years ago by confusedious420
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1086 users visited in the last hour