Subset FASTA file for taxonomy (phylum name) in R
1
1
Entering edit mode
2.2 years ago
Ellen ▴ 20

Dear all,

I would like to subset a FASTA file so that I get the sequences belonging to a certain phylum (in my case: Nematoda). The headers of the FASTA file start with the phylum name, so I thought this would be straightforward to do this in R, but I don't know how.

The FASTA file I am referring to can be downloaded from: http://www.reference-midori.info/download.php# , and I downloaded the following database: 'download/Databases/GenBank250/DADA2/longest/MIDORI2_LONGEST_NUC_GB250_CO1_DADA2.fasta.gz'.

Any tips?

Thanks! Ellen

FASTA R • 955 views
ADD COMMENT
0
Entering edit mode

Post examples of actual fasta headers since that is critical information to get help.

ADD REPLY
0
Entering edit mode

my apologies! I actually did not appear to get a notification that a reply had been posted, hence the delay.

here are some example headers:

[1] "Discosea_555280;Flabellinia_1485085;order_Vannellidae_95227;Vannellidae_95227;Clydonella_218657;Clydonella"
[2] "Discosea_555280;Flabellinia_1485085;order_Vannellidae_95227;Vannellidae_95227;Paravannella_1443143;Paravannella"

ADD REPLY
2
Entering edit mode
2.2 years ago

Does it need to be in R? Using seqkit, it would be just

seqkit grep -r -p Nematoda  MIDORI2_LONGEST_NUC_GB250_CO1_DADA2.fasta

or

zcat MIDORI2_LONGEST_NUC_GB250_CO1_DADA2.fasta.gz | seqkit grep -r -p Nematoda

and you are done.

ADD COMMENT
1
Entering edit mode

Thank you so much! This worked!

In the meantime I had found another way using bash/shell (see below), but your way seems a bit simpler.

awk '/^>/ { p = ($0 ~ /Nematoda/)} p' ../MIDORI2_LONGEST_NUC_GB250_CO1_DADA2.fasta> MIDORI2_LONGEST_NUC_GB250_CO1_DADA2_Nematoda.fa
ADD REPLY

Login before adding your answer.

Traffic: 1717 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6