Question

Subset FASTA file for taxonomy (phylum name) in R

1

Entering edit mode

21 months ago

Ellen ▴ 20

Dear all,

I would like to subset a FASTA file so that I get the sequences belonging to a certain phylum (in my case: Nematoda). The headers of the FASTA file start with the phylum name, so I thought this would be straightforward to do this in R, but I don't know how.

The FASTA file I am referring to can be downloaded from: http://www.reference-midori.info/download.php# , and I downloaded the following database: 'download/Databases/GenBank250/DADA2/longest/MIDORI2_LONGEST_NUC_GB250_CO1_DADA2.fasta.gz'.

Any tips?

Thanks! Ellen

FASTA R • 806 views

ADD COMMENT • link 11 months ago by Ellen ▴ 20

0

Entering edit mode

Post examples of actual fasta headers since that is critical information to get help.

ADD REPLY • link 21 months ago by GenoMax 142k

0

Entering edit mode

my apologies! I actually did not appear to get a notification that a reply had been posted, hence the delay.

here are some example headers:

[1] "Discosea_555280;Flabellinia_1485085;order_Vannellidae_95227;Vannellidae_95227;Clydonella_218657;Clydonella"
[2] "Discosea_555280;Flabellinia_1485085;order_Vannellidae_95227;Vannellidae_95227;Paravannella_1443143;Paravannella"

ADD REPLY • link 11 months ago by Ellen ▴ 20

score 2 · Accepted Answer · 2022-08-24

2

Entering edit mode

21 months ago

Matthias Zepper 4.6k

Does it need to be in R? Using seqkit, it would be just

seqkit grep -r -p Nematoda  MIDORI2_LONGEST_NUC_GB250_CO1_DADA2.fasta

or

zcat MIDORI2_LONGEST_NUC_GB250_CO1_DADA2.fasta.gz | seqkit grep -r -p Nematoda

and you are done.

ADD COMMENT • link 21 months ago by Matthias Zepper 4.6k

1

Entering edit mode

Thank you so much! This worked!

In the meantime I had found another way using bash/shell (see below), but your way seems a bit simpler.

awk '/^>/ { p = ($0 ~ /Nematoda/)} p' ../MIDORI2_LONGEST_NUC_GB250_CO1_DADA2.fasta> MIDORI2_LONGEST_NUC_GB250_CO1_DADA2_Nematoda.fa

ADD REPLY • link 11 months ago by Ellen ▴ 20