Question: Trim sequences in FASTA file from sanger sequencing
0
gravatar for merajazizmeraj
5.5 years ago by
United States
merajazizmeraj20 wrote:

Is there a tool out there that trims FASTA sequences at the beginning and ends.

By trimming i mean take the input sequences and generate the ouputs. I have more than 3000 sequences.

Thanks.

INPUT:

>Seq1
NNNNNNNNNNTTGNNNNGGATNTCCTTTCCGAATATTTTTGGTGCATTTGTAATAAATGTCATTTNTCTCCTTTTTAAAGGAATTGTCTTAGAAGAAAGAAGGCAAGCCACCATTTTACCCACGTAAATATATGAATATATTTCTGACATTGAGGTGTTCCAGAAGATGATAAAGAAATGATAGCAGCTCCAGAAATACCAACTGATTTTAATCTACTACAGTAAGTAAATTATATTCTGATAATTTTTAAATACTTGTTTATTCCACAAAATGGGGAATGCATTAACTTCAGTTAAATTTCCTTCTGCTCGAGAAGATCTAATATATAAAATAGCTTTTATGCTTTGCAAGAGTTTATATCAGNANCNNNNNNNNNNCNGN
>Seq2
NNNANNNNNGNNNNGTATGANGTTTTGGGGAACATCTTAATTACTTATAATGCTAATATGAAGTTTTGTAATGAGTTAACCAAGCCTTTCTTTTAGAAAATATGGCAAAAATTAGAAACTCAATATAAATTTCTAAGGAAGGGTTTTAATTCTTATCTTTCTGTCACAGGGAGTCAGAAACACATTTTTCTTCTGACACAGATTTTGAAGATATCGAAGGAAAAAACCAAAAGCAAGGCAAAGGCAAAGTATGTATCAAATATTTGACTTTATTTTGTTTCCTAAGATCTCACACACACACAGATTTAAGTTATGTCTCAGATAGTTTTATCTTTTAAAAATGGCTTTTTAAGGGGGTGGGAGCTGATTGGTATGGTAANCAN
>Seq3
NNNNNGNNNNNNNNNTNNNTNNNTNNNAAGTGGATGGAATTCTTTAGGGCAAGTTTAAGCATGTTATGTACCCTATCAGCTACTTCTACTGTAGCTGTGTTTTGAACTCTCAAGGATAGTGATATAACTTAACCACCTCGTATTTTTTATGCAGACTTGTAAAAAAGGCAAAAAGGGCCCAGCAGAAAAGGGCAAAGGTGGAAATGGAGGAGGAAAACCTCCTTCTGGTCCAAACCGAATGAATGGTCATCACCAACAGAATGGAGTGGAAAACATGATGTTGTTTGAAGTTGTTAAAATGGGCAAGAGTGCTATGCAGGTAAGATTTATGTTGTTCTTCCCAGTTCATTTGTACATTTTAAACTTTAATGAGTTATATAGAGTGTAGCTCTGNNNNNNNNNNTTGCAA
>Seq4
NNNNNNNCNCNNNNNGNGNNNNCNAAGTGACTATTTGAGAGCTGCTGATTTCAAAATAAATATATCTTACCTTTACAGCCTGAACACTGAATAAAAAAGTTGATAAGGTCAAGAAGTGCTATATCTCGGTCATGCTTGTATGATTCTATCCAATCATCTACCACCGACTACAGCAGAGGGAAAAAAATAAAATCATTAGCTTCTTCTAATTTTCTCAAAATCAATTAAGTCTGATAAAGTCATAAAATTCAAGATTATATAGTATCACATTACTTTAATATAAATACTTATACACTGAAATTTAAAGTTCAATTTTAACAATAATAAAATAGAATCGAATTCAGTAAAACAATTATCTGATAACACAAAATGACCTATCAATCTTCTATTTATTTTGCATTGAAAAGAATGTGGNNN
>Seq5
NNNNNNNNNAANNNNNNNNNNNNNNNNNNNTNNNNANNNNNNNNNNNTAAGTTATCAAAACACTTAAGGTAGTAAGTTACCTCATCGAATTCTTCAGTCATTTTTCGAATTATCTCAGAGTTCTGCATATGTCTAAACATTTCTGCTGTGACAACTCCTGAAATTTGCAAATGTCAGAAGTTAATATATGGTGTGATAAAAAAATAAAGAAAACTTCCAAGTAAGTCTCTAACACTAAGAAGTCTATGGTCACACAATAAAAGGCATACTTCTTCAACCATCATCTAATAATCTTTACCATGATACTCTAATCTATAAATAAAGCACAAACAAATGCTATCTATTCTCAGTATGCACAAGAAAACAGCCCCATACTTCTGACAGATATCTTTTTTCCTAACACAATTAACTTTGGCCATTTCTANNNNNNNNNNTTNNNNAAN

OUTPUT:

>Seq1
TCCTTTCCGAATATTTTTGGTGCATTTGTAATAAATGTCATTTNTCTCCTTTTTAAAGGAATTGTCTTAGAAGAAAGAAGGCAAGCCACCATTTTACCCACGTAAATATATGAATATATTTCTGACATTGAGGTGTTCCAGAAGATGATAAAGAAATGATAGCAGCTCCAGAAATACCAACTGATTTTAATCTACTACAGTAAGTAAATTATATTCTGATAATTTTTAAATACTTGTTTATTCCACAAAATGGGGAATGCATTAACTTCAGTTAAATTTCCTTCTGCTCGAGAAGATCTAATATATAAAATAGCTTTTATGCTTTGCAAGAGTTTATATCAGNANC
>Seq2
GTATGANGTTTTGGGGAACATCTTAATTACTTATAATGCTAATATGAAGTTTTGTAATGAGTTAACCAAGCCTTTCTTTTAGAAAATATGGCAAAAATTAGAAACTCAATATAAATTTCTAAGGAAGGGTTTTAATTCTTATCTTTCTGTCACAGGGAGTCAGAAACACATTTTTCTTCTGACACAGATTTTGAAGATATCGAAGGAAAAAACCAAAAGCAAGGCAAAGGCAAAGTATGTATCAAATATTTGACTTTATTTTGTTTCCTAAGATCTCACACACACACAGATTTAAGTTATGTCTCAGATAGTTTTATCTTTTAAAAATGGCTTTTTAAGGGGGTGGGAGCTGATTGGTATGGTAANCA
>Seq3
AAGTGGATGGAATTCTTTAGGGCAAGTTTAAGCATGTTATGTACCCTATCAGCTACTTCTACTGTAGCTGTGTTTTGAACTCTCAAGGATAGTGATATAACTTAACCACCTCGTATTTTTTATGCAGACTTGTAAAAAAGGCAAAAAGGGCCCAGCAGAAAAGGGCAAAGGTGGAAATGGAGGAGGAAAACCTCCTTCTGGTCCAAACCGAATGAATGGTCATCACCAACAGAATGGAGTGGAAAACATGATGTTGTTTGAAGTTGTTAAAATGGGCAAGAGTGCTATGCAGGTAAGATTTATGTTGTTCTTCCCAGTTCATTTGTACATTTTAAACTTTAATGAGTTATATAGAGTGTAGCTCTG
>Seq4
AAGTGACTATTTGAGAGCTGCTGATTTCAAAATAAATATATCTTACCTTTACAGCCTGAACACTGAATAAAAAAGTTGATAAGGTCAAGAAGTGCTATATCTCGGTCATGCTTGTATGATTCTATCCAATCATCTACCACCGACTACAGCAGAGGGAAAAAAATAAAATCATTAGCTTCTTCTAATTTTCTCAAAATCAATTAAGTCTGATAAAGTCATAAAATTCAAGATTATATAGTATCACATTACTTTAATATAAATACTTATACACTGAAATTTAAAGTTCAATTTTAACAATAATAAAATAGAATCGAATTCAGTAAAACAATTATCTGATAACACAAAATGACCTATCAATCTTCTATTTATTTTGCATTGAAAAGAATGTGG
>Seq5
TAAGTTATCAAAACACTTAAGGTAGTAAGTTACCTCATCGAATTCTTCAGTCATTTTTCGAATTATCTCAGAGTTCTGCATATGTCTAAACATTTCTGCTGTGACAACTCCTGAAATTTGCAAATGTCAGAAGTTAATATATGGTGTGATAAAAAAATAAAGAAAACTTCCAAGTAAGTCTCTAACACTAAGAAGTCTATGGTCACACAATAAAAGGCATACTTCTTCAACCATCATCTAATAATCTTTACCATGATACTCTAATCTATAAATAAAGCACAAACAAATGCTATCTATTCTCAGTATGCACAAGAAAACAGCCCCATACTTCTGACAGATATCTTTTTTCCTAACACAATTAACTTTGGCCATTTCTA

nnnn trim sequencing sanger • 3.7k views
ADD COMMENTlink written 5.5 years ago by merajazizmeraj20

just to clarify your post, what you want to happen is to trim back from both the 5' and 3' ends to the point where no more N's are visible within a certain distance

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Istvan Albert ♦♦ 81k

yes...that is correct.

ADD REPLYlink written 5.5 years ago by merajazizmeraj20
2
gravatar for Devon Ryan
5.5 years ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:

Here's an awk solution (because why not). It looks at bins of 5 bases and will trim them off either end if they contain an N. You can modify this at will, of course. Just change foo.fa to whatever your file is called and then pipe things to a new file.

awk '{
header=$0;
getline;
for(five_prime=1;five_prime<length($1)-5;five_prime++) {
    s=substr($1,five_prime,5);
    if(index(s,"N")==0) break;
}
for(three_prime=length($1)-4;three_prime>five_prime;three_prime--) {
    s=substr($1,three_prime,5);
    if(index(s,"N")==0) break;
}
printf("%s\n%s\n",header,substr($1,five_prime,three_prime-five_prime+5));
}' foo.fa

 

Edit: Fixed an off-by-one error.

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by Devon Ryan92k
0
gravatar for xb
5.5 years ago by
xb400
Chapel Hill, NC,USA
xb400 wrote:

How about this,

FASTA/Q Trimmer

 

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by xb400

Thanks. I did look at this and am using galaxy but it is not trimming the beginning and ends. It just removes sequences that have lots of NNNNNNN's in them.

ADD REPLYlink written 5.5 years ago by merajazizmeraj20
0
gravatar for Istvan Albert
5.5 years ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

A more generic solution for your problem could be to find the longest substring that is bounded by Ns. A simple python script like the one below could do that:

import sys
for line in sys.stdin:
   if line[0] == ">":
        print line 
        continue
    line = line.strip()
    pieces = line.split("N")
    sizes = sorted(((len(p), p) for p in pieces), reverse=True)
    longest = sizes[0][1]
    print longest

Run it with python trim.py < input.fasta

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by Istvan Albert ♦♦ 81k

wont this split the string if there are a few N's in between the sequences. I would like to trim the ends as much as possible

and tolerate some N's in the middle. Thanks

ADD REPLYlink written 5.5 years ago by merajazizmeraj20
1

correct, like I said this will give you the longest substring that is bounded by Ns

it is just a different way to think about the problem, and when one does so they may identify different requirements

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Istvan Albert ♦♦ 81k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2856 users visited in the last hour