Question

Remove Invalid Nucleotides From Fasta File

0

Entering edit mode

11.8 years ago

Justin ▴ 460

Fasta files typically contain invalid characters such as "NNNN", how can I remove those and then make sure I still have e.g. a 60 nucleotides per line?

I feel like there's probably a package that does this already.

fasta nucleotide • 8.7k views

ADD COMMENT • link updated 11.8 years ago by Flashton • 0 • written 11.8 years ago by Justin ▴ 460

2

Entering edit mode

Others have pointed out that "N" is not invalid; it means "base not known". I'll point it out again, because it's important.

ADD REPLY • link 11.8 years ago by Neilfws 49k

score 4 · Answer 1 · 2012-07-03

4

Entering edit mode

11.8 years ago

Leonor Palmeira 3.8k

You could this in R with the seqinR package:

library(seqinr)
x=read.fasta("myoriginal.fasta")
new=lapply(seq(length(x)), function(i) {
      s2c(gsub("N","",c2s(getSequence(x[[i]]))))
})
write.fasta(new,names=names(x),file="newfile.fasta",nbchar=60)

But the N characters do have a meaning, they are not 'invalid nucleotides'. They mean that this base is not fully determined in the sequence you have, so you should make sure taking them out is really what you want to do...

ADD COMMENT • link 11.8 years ago by Leonor Palmeira 3.8k

4

Entering edit mode

Commenting to re-emphasize this point: "you should make sure taking them out is really what you want to do"

ADD REPLY • link 11.8 years ago by Chris Miller 22k

score 2 · Answer 2 · 2012-07-03

this simple C program should discard all the non ATGC nucleotides:

#include <stdio.h>
int main(int argc,char** argv)
    {
    int c;
    int N=0;
    do
        {
        c=fgetc(stdin);
        switch(c)
            {
            case EOF:
            case '>':
                {
                if(N!=0) fputc('\n',stdout);
                if(c==EOF) break;
                fputc(c,stdout);
                while((c=fgetc(stdin))!=EOF && c!='\n')
                    {
                    fputc(c,stdout);
                    }
                fputc('\n',stdout);
                N=0;
                break;
                }
            case 'A': case 'a':
            case 'T': case 't':
            case 'G': case 'g':
            case 'C': case 'c':
                {
                if(N==60) { fputc('\n',stdout); N=0;}
                fputc(c,stdout);
                ++N;
                break;
                }
            default:break;
            }
        } while(c!=EOF);
    return 0;
    }

score 2 · Answer 3 · 2012-07-03

try this kinda ugly python script: (need biopython)

import sys
from Bio import SeqIO

faFile = open('your file.fa','r')

for record in SeqIO.parse(faFile,'fasta'):
  filteredSeq = str(record.seq).lower().replace('n','')
  outputLines = [filteredSeq[i:i+60] for i in range(0, len(filteredSeq), 60)]

  print ">" + record.id
  for outputLine in outputLines:
    print outputLine

score 0 · Answer 4 · 2012-10-24

A colleague helped me do this in quite a neat way using regular expressions in python (biopython required).

from Bio import SeqIO
import re

handle = open("/Users/phil/Documents/Infernal/NCBI_nr_clean.fasta", "rU")

sequences={}

for record in SeqIO.parse(handle, "fasta") :
    #replace any non-GATC characters in your sequences with nothing
    sequence = re.sub('[^GATC]', "", str(record.seq).upper())
    #discard sequences that are less than 60 bp long
    if (len(sequence))>=60:
        sequences[sequence]=record.id

output_file=open("output.fasta","w+")

for sequence in sequences:
    output_file.write(">"+sequences[sequence]+"\n"+sequence+"\n")
output_file.close()
handle.close()