removing fasta sequences that have Ns in it in a fasta file
2
0
Entering edit mode
8.1 years ago
kws15 ▴ 40

Hi everyone,

i have a giant fasta file, but some of the sequences have got Ns in them

GeneID:107003026

AAATTTACTTGTCCTTGTGAT

GeneID:107005138

TATGCACNNNGGTTGC

GeneID:107004481

GATTTTATGTTGCTGAA

so the second one has got Ns in them, what can i do to get rid of the whole sequence so that the outcome would look like this? thank you very much

GeneID:107003026

AAATTTACTTGTCCTTGTGAT

GeneID:107004481

GATTTTATGTTGCTGAA

fasta • 7.1k views
ADD COMMENT
2
Entering edit mode

did you search for similar posts on biostars.org ? what did you find ? what have you tried ?

ADD REPLY
1
Entering edit mode

There are many ways to do this correctly, but are you sure you want to? What is your rationale?

ADD REPLY
0
Entering edit mode

i am doing analysis on promoter sequences for two close species in which i would need to align them together and see the similarity , so i would need to use may be blast, but it just gives me error when i tried to do that in R when sequences contain Ns, so i guess i would just have to ignore the sequences that have Ns.

ADD REPLY
0
Entering edit mode

That doesn't look like a FASTA file (no ">").

ADD REPLY
0
Entering edit mode

yeah, there are some '>' s in my file, they are just gone when i posted them here for some reason

ADD REPLY
0
Entering edit mode

If you want to use only the default unix tools, you can use grep to filter out Ns (assuming your sequence names do not have Ns):

grep -v "N" in.fa

Then filter out empty records (where sequence was removed by grep):

awk '$2{print RS}$2' FS='\n' RS=> ORS= in.fasta

ADD REPLY
3
Entering edit mode

I do not recommend this, as it is unsafe. A good solution should handle all possible Fasta variants, whether they are multi-line, contain 'N' in headers, etc.

ADD REPLY
0
Entering edit mode

It didn't look like any of those sequences were in danger of being multi-line. To be safe, you can convert multi-line fasta to single-line fasta:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}'
ADD REPLY
4
Entering edit mode
8.1 years ago

Python, using Biopython

import sys
from Bio import SeqIO
handle = open(sys.argv[1], "rU")
filtered = [record for record in SeqIO.parse(handle, "fasta") if record.seq.count('N') == 0]
output_handle = open("N_removed.fasta", "w")
SeqIO.write(filtered, output_handle, "fasta")
output_handle.close()
handle.close()

Save script (e.g. removeNfromfas.py) and execute as python removeNfromfas.py <yourfile.fasta>

Updated version 19 months later:

import sys
from Bio import SeqIO
for record in SeqIO.parse(sys.argv[1], "fasta"):
    if record.seq.count('N') == 0:
        print(record.format("fasta")

Save script (e.g. removeNfromfas.py) and execute as `python removeNfromfas.py yourfile.fasta > newfile.fasta More flexible and can handle enormous files, if necessary. Lower memory requirements.

ADD COMMENT
0
Entering edit mode

thank you very much , that works!

ADD REPLY
0
Entering edit mode
8.1 years ago

Using pyfaidx and awk:

ADD COMMENT

Login before adding your answer.

Traffic: 3065 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6