How to remove poor quality sequences from a fasta file?
1
0
Entering edit mode
3.7 years ago
Kumar ▴ 120

I have a fasta file, which consist of thousands of viral genomes. I need to remove poor quality genomes which contain more than 30% NNNNNNN. Therefore, kindly help me to do the same.

perl python shell bash fasta • 1.4k views
ADD COMMENT
0
Entering edit mode

I need to remove poor quality genomes which contain more than 30% NNNNNNN.

You can't assume that those genomes are poor quality. Perhaps that region is just not sequenced. N's are also used to pad/indicate areas that are not sequenced/sequenceable using current technologies.

ADD REPLY
0
Entering edit mode

I agree @genomax, But these sequences are creating problems while alignment, that is why I would like to remove the same. I have good sum of viral genomes, therefore, I would like to keep precise base called genomes rather than the NNNN contains genomes.

ADD REPLY
3
Entering edit mode
3.7 years ago
microfuge ★ 1.9k

UCSC browser utils has a binary which does exactly that available here http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/faFilterN

faFilterN - Get rid of sequences with too many N's

usage: faFilterN in.fa out.fa maxPercentN

ADD COMMENT
0
Entering edit mode

Thank you @microfuge. I have one clarification, in the command line maxPercentN should be replaced by 30 or 30%?

ADD REPLY
1
Entering edit mode

From what I remember Just the number without percentage sign.

ADD REPLY
1
Entering edit mode

What do you think? Will % be a valid input for an option?

ADD REPLY
1
Entering edit mode

My rule of thumb: If doing something takes less than 10 seconds and can't hurt anyone, I just do it. Writing a question and waiting for a response is guaranteed to take longer than that.

ADD REPLY
0
Entering edit mode

I regret for the same.

ADD REPLY

Login before adding your answer.

Traffic: 2551 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6