Excluding columns in a dataframe based on a character string of column names to exclude
1
0
Entering edit mode
8.3 years ago
confusedious ▴ 470

Hello everyone,

I apologise in advance if the terminology used in the title is misleading; I am not totally familiar with all object type terms, but I believe what I have posted is at least mostly correct.

I have a script for extracting sequences from a phyDat object (see packages 'ape' and 'phangorn') in R that is based on using subset and and a character string of the column names I wish to retain. See code below:

newalign <- as.phyDat(subset(aligndf, select = seqkeep))

In this case, 'aligndf' is the complete original alignment that has been transformed into a data frame in an earlier part of the script. Here I use 'subset' and 'select' to generate a new alignment object via 'as.phyDat' that consists only of the sequence names contained in the object 'seqkeep'. As an example, the contents of 'seqkeep' looks like the following:

[1] "hominin23"                                                           
[2] "hominin33"                                                                      
[3] "hominin47"

This procedure works well, and from this I gain exactly what I wanted, which is a new alignment that consists only of the sequences given in 'seqkeep'.

When I try to then write a second alignment that consists only of the sequences not in 'seqkeep', I have encountered a problem. No matter what I have tried, the resulting alignment is the complete original alignment that still includes the 'seqkeep' sequences.

Here are my most recent attempts based on some guides I have seen online:

remainalign <- as.phyDat(subset(aligndf, aligndf =! seqkeep))

remainalign <- as.phyDat(subset(aligndf, !(aligndf == seqkeep)))

Could anyone advise me on how to correctly render this task in R?

Thank you for your help.

R data frame • 3.8k views
ADD COMMENT
1
Entering edit mode

?subset

Warning

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

ADD REPLY
2
Entering edit mode
8.3 years ago
Michael 54k

Don't use the subset function, the normal subsetting is much more readable. I am having problems determining the structure of your data please post head(seqkeep) and aligndf, so we can see the column names. You possibly want something like:

aligndf[ ,!(colnames(aligndf) %in% seqkeep)]

Edit: changed to column selection, it is unlcear what your goal is here.

or even simpler

aligndf[,seqkeep] # if rownames are compatible with 
# seqkeep and all seqkeep are in rownames

extracting in R is normally very straight forward to code and read

see ?match ?extract ?Comparison

In my R build the following is less readable but slightly faster than %in%:

aligndf[,!match(aligndf, seqkeep, nomatch=0)] # if you need to do that often

you can further speed this up using package fastmatch

Also, we are moving far away from bioinformatics here.


subset(aligndf, aligndf =! seqkeep) # what's wrong?

Also subset works on rows, not columns by default.

You are trying to extract the column aligndf from aligndf and trying to comparing it to a smaller vector using the non-exiting operator =!. You meant !=, but comparison is not the same as set operation, and == or != are not the right operators. It is just coincidence that it didn't throw an error in the first place.

ADD COMMENT
1
Entering edit mode

Also, I apologise if the content drifted a little too far into basic scripting as opposed to bioinformatics proper.

I have always had a much more positive experience getting answers here than on StackOverflow; people here are more understanding about the fact that it can take time for someone from a biology background to fully grasp handling computational and scripting issues effectively.

ADD REPLY
0
Entering edit mode

Thank you Michael.

The below solved my problem nicely:

aligndf[ ,!(colnames(aligndf) %in% seqkeep)]

The object 'seqkeep' was a list of column names. I have now successfully altered the script so that it writes out two new .fasta alignments. The first with only the sequences in 'seqkeep' and the second with only those not in 'seqkeep'.

ADD REPLY

Login before adding your answer.

Traffic: 2587 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6