Question: Remove Invalid Nucleotides From Fasta File
0
gravatar for Justin
6.7 years ago by
Justin440
United States
Justin440 wrote:

Fasta files typically contain invalid characters such as "NNNN", how can I remove those and then make sure I still have e.g. a 60 nucleotides per line?

I feel like there's probably a package that does this already.

fasta nucleotide • 5.3k views
ADD COMMENTlink written 6.7 years ago by Justin440
2

Others have pointed out that "N" is not invalid; it means "base not known". I'll point it out again, because it's important.

ADD REPLYlink written 6.7 years ago by Neilfws48k
4
gravatar for Leonor Palmeira
6.7 years ago by
Leonor Palmeira3.7k
Li├Ęge, Belgium
Leonor Palmeira3.7k wrote:

You could this in R with the seqinR package:

library(seqinr)
x=read.fasta("myoriginal.fasta")
new=lapply(seq(length(x)), function(i) {
      s2c(gsub("N","",c2s(getSequence(x[[i]]))))
})
write.fasta(new,names=names(x),file="newfile.fasta",nbchar=60)

But the N characters do have a meaning, they are not 'invalid nucleotides'. They mean that this base is not fully determined in the sequence you have, so you should make sure taking them out is really what you want to do...

ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by Leonor Palmeira3.7k
4

Commenting to re-emphasize this point: "you should make sure taking them out is really what you want to do"

ADD REPLYlink written 6.7 years ago by Chris Miller20k
2
gravatar for Pierre Lindenbaum
6.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum117k wrote:

this simple C program should discard all the non ATGC nucleotides:

#include <stdio.h>
int main(int argc,char** argv)
    {
    int c;
    int N=0;
    do
        {
        c=fgetc(stdin);
        switch(c)
            {
            case EOF:
            case '>':
                {
                if(N!=0) fputc('\n',stdout);
                if(c==EOF) break;
                fputc(c,stdout);
                while((c=fgetc(stdin))!=EOF && c!='\n')
                    {
                    fputc(c,stdout);
                    }
                fputc('\n',stdout);
                N=0;
                break;
                }
            case 'A': case 'a':
            case 'T': case 't':
            case 'G': case 'g':
            case 'C': case 'c':
                {
                if(N==60) { fputc('\n',stdout); N=0;}
                fputc(c,stdout);
                ++N;
                break;
                }
            default:break;
            }
        } while(c!=EOF);
    return 0;
    }
ADD COMMENTlink written 6.7 years ago by Pierre Lindenbaum117k
2
gravatar for Damian Kao
6.7 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

try this kinda ugly python script: (need biopython)

import sys
from Bio import SeqIO

faFile = open('your file.fa','r')

for record in SeqIO.parse(faFile,'fasta'):
  filteredSeq = str(record.seq).lower().replace('n','')
  outputLines = [filteredSeq[i:i+60] for i in range(0, len(filteredSeq), 60)]

  print ">" + record.id
  for outputLine in outputLines:
    print outputLine
ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by Damian Kao15k
0
gravatar for Flashton
6.4 years ago by
Flashton0
Flashton0 wrote:

A colleague helped me do this in quite a neat way using regular expressions in python (biopython required).

from Bio import SeqIO
import re

handle = open("/Users/phil/Documents/Infernal/NCBI_nr_clean.fasta", "rU")

sequences={}

for record in SeqIO.parse(handle, "fasta") :
    #replace any non-GATC characters in your sequences with nothing
    sequence = re.sub('[^GATC]', "", str(record.seq).upper())
    #discard sequences that are less than 60 bp long
    if (len(sequence))>=60:
        sequences[sequence]=record.id

output_file=open("output.fasta","w+")

for sequence in sequences:
    output_file.write(">"+sequences[sequence]+"\n"+sequence+"\n")
output_file.close()
handle.close()
ADD COMMENTlink written 6.4 years ago by Flashton0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2093 users visited in the last hour