Question: working with Sanger fasta + qual + clip files
0
gravatar for sullis02
4 weeks ago by
sullis020
sullis020 wrote:

I have an old Sanger fasta file and its associated qual file, as well as a 'clip' file that shows where to trim low-quality sequence for each fasta sequence. e.g. for the first record in the fasta file here are the corresponding data in each file:

fasta:

>gnl|ti|713904976 1047095293492
CCCCAGAAGCCCCGCGCTTTTTCCATTTCACGACGTTGTAAAACGACGGCCAGTGAATTGTAATACGACT
CACTATAGGGCGAATTCGAGCTCGGTACCCGGGGATCCCACGTACGCTTCATCATCGCCCTTTTCAATTT
CTGCATATTCCGATGGCTTGTATGAGACAAACGCATTCATTTTCCAGCTTTTATCCAAAGCAAGTTACTC
AACTCCATATTATTCAAGTTTGATCCACTCCAGGGCCATCTTTTCTCGATGGACAACAGGGGACATCATT
GATGGATCGGATGACGGAGGTGCTTCTGTTTCTGCAAAGAAAACTCTGCAAAAAGACGTAAATGAAACAG
GAGGAATCTTCAGAGGAACAAGAAGTTACATGGACTTCCCATTTGCAAACTACAGAGATACTTCCAACTC
GGTTTTGAACCTCGGACCAAGTGAATCC

qual:

>gnl|ti|713904976 1047095293492
  9  9  9  8  6  9  6  7 10 10 10 10  9  9  7  9  7 10 10  9
  9 10  9  9  9  9  9  9 12 12 10  9  9  9  9 12 21 12 13 17
 20 21 20 21 28 25 24 30 21 25 21 21 23 23 31 28 25 28 23 23
 29 23 44 33 33 29 34 34 26 29 32 32 32 32 40 40 38 29 45 35
 28 23 29 29 22 22 22 26 32 29 32 29 29 29 30 30 33 38 33 25
 25 35 20 20 22 20 20 29 29 27 24 22 30 24 27 24 34 25 34 38
 29 31 29 23 24 24 23 32 32 40 39 39 27 19 21 14 22 22 22 45
 40 26 44 38 28 28 26 40 32 34 36 40 36 40 39 27 21 23 21 39
 39 32 28 32 40 26 32 26 25 35 34 32 40 40 28 36 40 38 39 39
 36 32 29 28 28 28 28 29 45 40 36 40 34 29 32 24 18 18 23 27
 32 36 47 47 38 33 25 30 24 30 20 19 24 24 27 28 27 26 18 29
 25 23 27 27 21 15 22 23 30 29 35 38 40 31 34 24 21 26 23 18
 17 19 19 22 28 24 24 26 36 36 28 40 32 40 28 23 32 22 24 22
 22 22 22 29 29 26 39 40 29 32 24 28 29 29 40 36 32 28 29 24
 24 24 20 28 31 40 38 38 34 40 38 27 39 24 18 22 22 24 16 26
 23 39 40 28 28 49 49 32 27 27 23 28 29 27 44 34 29 25 20 27
 27 29 23 23 39 32 49 40 28 25 28 39 28 28 23 39 32 32 40 28
 23 27 27 23 27 36 39 17 17 26 49 49 32 39 39 39 28 28 19 18
 21 14 13 18 11 18 27 25 23 17 23 25 27 23 24 17 23 23 23 21
 39 27 25 32 32 36 32 18 16 12 23 19 20 29 26 28 21 26 26 26
 27 29 23 16 12 25 18 23 24 24 18 23 27 23 28 27 23 13 15 13
 13 14 13 20 11 15 25 23 27 24 25 27 24 19 23 27 23 27 23 20
 18 27 27 27 23 10 15 14

clip:

TI      CLIP_LEFT       CLIP_RIGHT
713904976       112     445

I'd like to generate a fastq file consisting of the trimmed fasta sequences + their qual scores. (Or, generate the fastq file of the whole fasta records first, then trim it using the clip data as instructions. )

What tools can do that?

fastq trimming sanger • 126 views
ADD COMMENTlink modified 4 weeks ago by JC7.7k • written 4 weeks ago by sullis020
0
gravatar for JC
4 weeks ago by
JC7.7k
Mexico
JC7.7k wrote:

A simple script can do the trick:

#!/usr/bin/perl

use strict;
use warnings;

$ARGV[2] or die "use convert.pl FASTA QUAL CLIP > FASTQ\n";

my $phred = 33; # Define Phred scale, it can be 33 or 64

my $seq_file  = shift @ARGV;
my $qual_file = shift @ARGV;
my $clip_file = shift @ARGV;

my ($id, $seq, $qual, $c_ini, $c_end);

# Read fasta sequence
open (my $fh, "<", $seq_file) or die "cannot read $seq_file\n";
while (<$fh>) {
    chomp;
    if (/>(.+)/) {
       $id = $1;
    }
    else {
       $seq .= $_;
    }
}
close $fh;

# Read quals
open (my $qh, "<", $qual_file) or die "cannot read $qual_file\n";
while (<$qh>) {
    chomp;
    next if (/>/);
    s/^\s+//; # remove spaces at the begin
    s/\s+$//; # remove spaces at the end
    my @values = split (/\s+/, $_);
    foreach my $v (@values) {
        $qual .= chr($v + $phred);
    }
}
close $qh;

# check is sizes are equal
die "Sequence and quality lenghts are not equal\n" if (length $seq != length $qual);

# Read clip coordinates
open (my $ch, "<", $clip_file) or die "cannot read $clip_file\n";
while (<$ch>) {
    chomp;
    next if (/^TI/); # skip header
    my @elem = split (/\s+/, $_);
    #verify if the id is the same for the sequence
    if ($id =~ /$elem[0]/) {
        $c_ini = $elem[1];
        $c_end = $elem[2];
        # clip sequences
        $seq  = substr($seq,  $c_ini - 1, $c_end - $c_ini);
        $qual = substr($qual, $c_ini - 1, $c_end - $c_ini);
    }
    else {
        warn "clip id doesn't match sequence id, printing sequence without modifications\n";
    }
    last; #only one line is allowed to modify the sequence
}
close $ch;

print "\@$id\n$seq\n+\n$qual\n";

Example:

$ perl convert.pl seq.fasta seq.qual seq.clip
@gnl|ti|713904976 1047095293492
GTACGCTTCATCATCGCCCTTTTCAATTTCTGCATATTCCGATGGCTTGTATGAGACAAACGCATTCATTTTCCAGCTTTTATCCAAAGCAAGTTACTCAACTCCATATTATTCAAGTTTGATCCACTCCAGGGCCATCTTTTCTCGATGGACAACAGGGGACATCATTGATGGATCGGATGACGGAGGTGCTTCTGTTTCTGCAAAGAAAACTCTGCAAAAAGACGTAAATGAAACAGGAGGAATCTTCAGAGGAACAAGAAGTTACATGGACTTCCCATTTGCAAACTACAGAGATACTTCCAACTCGGTTTTGAACCTCGGACCAAGTGA
+
7?9<9C:CG>@>8998AAIHH<46/777NI;MG==;IACEIEIH<686HHA=AI;A;:DCAII=EIGHHEA>====>NIEIC>A9338<AEPPGB:?9?5499<=<;3>:8<<6078?>DGI@C96;832447=99;EE=IAI=8A797777>>;HI>A9=>>IEA=>9995=@IGGCIG<H937791;8HI==RRA<<8=><MC>:5<<>88HARI=:=H==8HAAI=8<<8<EH22;RRAHHH==436/.3,3<:828:<8928886H<:AAEA31-845>;=6;;;<>81-:389938<8=<8.0../.5,0:8<9:<948<8<853<<<
ADD COMMENTlink written 4 weeks ago by JC7.7k

Thank you very much!!

ADD REPLYlink written 2 days ago by sullis020

Oops, wrote to soon. Have you tried your script with input files that contain more than a single fasta/clip/qual record? With a single record as input (as in your example) it works fine. With normal fasta /qual/clip files (which contain numerous records) it doesn't work it reports ' clip id doesn't match sequence id, printing sequence without modifications' and then concatenates all the fasta and qual text into giant globs.

e.g the input fasta here has two records, as do the qual and clip files

1-2.fasta
>gnl|ti|713904976 1047095293492
CCCCAGAAGCCCCGCGCTT...etc....GTGAATCC
>gnl|ti|713904977 1047095293493
TCGGGAGGCCAGGGGTTTT...etc....ATATCTC

1-2.qual
>gnl|ti|713904976 1047095293492
  9  9  9  8  6  9  6  7 10 10 10 10  9  9  7  9  7 10 10  9  etc...

>gnl|ti|713904977 1047095293493
  9  9 10 10  9  9  9  9 10 12  9 10 11 20 17  7 10 15  9 10 etc....

1-2.clip
TI      CLIP_LEFT       CLIP_RIGHT
713904976       112     445
713904977       108     805

   $ perl convert.pl  001_1-2.fasta 001_1-2.qual 001_1-2.clip
    clip id doesn't match sequence id, printing sequence without modifications
    @713904977 1047095293493
    CCCCAGAAGCCCCGCGCTTTTTCCATTTCACGACGTTGTAAAACGACGGCCAGTGAATTGTAATACGACTCACTATAGGGCGAATTCGAGCTCGGTACCCGGGGATCCCACGTACGCTTCATCATCGCCCTTTTCAATTTCTGCATATTCCGATGGCTTGTATGAGACAAACGCATTCATTTTCCAGCTTTTATCCAAAGCAAGTTACTCAACTCCATATTATTCAAGTTTGATCCACTCCAGGGCCATCTTTTCTCGATGGACAACAGGGGACATCATTGATGGATCGGATGACGGAGGTGCTTCTGTTTCTGCAAAGAAAACTCTGCAAAAAGACGTAAATGAAACAGGAGGAATCTTCAGAGGAACAAGAAGTTACATGGACTTCCCATTTGCAAACTACAGAGATACTTCCAACTCGGTTTTGAACCTCGGACCAAGTGAATCCTCGGGAGGCCAGGGGTTTTTCCAGTCACGACGTTGTAAACGACGGCCAGTGAATTGTAATACGACTCACTATAGGGCGAATTCGAGCTCGGTACCCGGGGATCCCACGTACGCTTCATCATCGCCCTTTTCAATTTCTGCATATTCCGATGGCTTGTATGAGACAAACGCATTCATTTTCCAGCTTTTATCCAAAGCAAGTTACTCAACTCCATATTATTCAAGTTTGATCCACTCCAGGGCCATCTTTTCTCGATGGACAACAGGGGACATCATTGATGGATCGGATGACGGAGGTGCTTCTGTTTCTGCAAAGAAAACTCTGCAAAAAGACGTAAATGAAACAGGAGGAATCTTCAGAGGAACAAGAAGTTACATGGACTTCCCATTTGCAAACTACAGAGATACTTCCAACTCGGTTTTGAACCTCGGACCAAGTGAATCCAAACTGAATTTCACTACTGCTCCAACAAATGCACCAGGAGGACTTTCTGGAGTTCACATATTCCTAATTGTACTTGCAGTGATTGTTGTAGTAGTTGTTTGCGTAGTAGTTGTATGCTTAGTTCACAGATCCAACCAATCAAAGGCGGAAATGAACAAAAAGTTCAACTCAGACAAAGGTGATGAACTATCTGATGTAGAGGACAAAGAAACTCAACAGACTGGGTACATGTCTGGAACTCCTTCTCCTCAAAGCTCTGCTGCAACTACTACTACTGGTCATAACTCTGCAACTTCTACTGTGAATAGCACGGGAGTTGTTCCGGCTGATCCAAACTTGAGGGTGTGGAATCCGTACGAGATTTCGCCGACTACAACAACTAATGAAAAATAGTTTTGTTTATTCATGATATTGTTTCGTTAGTTGCTATTGTTAATTGTTCTATTTATAAATATAAGTTCTGTTTGTTTTATTTTTCTTATCGATGCTTCCTTCCTAAATAGAAATGTTTTTTTTTAGATCATAATATGATCAAAATTAAATTATGATTGACATGCGCAATTGACAAATTTTTCTTGGAAACAGGATGTATTTGTAATTATATCTTTAAAGAGCCTGATGCATGATATCTC
    +
    ***)'*'(++++**(*(++**+******--+****-6-.25656=:9?6:6688@=:=88>8MBB>CC;>AAAAIIG>ND=8>>777;A>A>>>??BGB::D55755>><97?9<9C:CG>@>8998AAIHH<46/777NI;MG==;IACEIEIH<686HHA=AI;A;:DCAII=EIGHHEA>====>NIEIC>A9338<AEPPGB:?9?5499<=<;3>:8<<6078?>DGI@C96;832447=99;EE=IAI=8A797777>>;HI>A9=>>IEA=>9995=@IGGCIG<H937791;8HI==RRA<<8=><MC>:5<<>88HARI=:=H==8HAAI=8<<8<EH22;RRAHHH==436/.3,3<:828:<8928886H<:AAEA31-845>;=6;;;<>81-:389938<8=<8.0../.5,0:8<9:<948<8<853<<<8+0/**++****+-*+,52(+0*+*+**--112=EENNEIENENC>>>355333>;:689>8M>?>=CB<::B@DI>>C>9D:<798<<8EA9MB8>;BB?G:;;<:C=>97??B>I79968<:==8AAAEI:887<5>E?9=;9=>AAA=AHAHA8==HAENAH8<<:<9=8=<<<:5<<IIAIAAAAAINPAA:5999>:4GG@>>;8888CADCEHEII=;;59=EEDGG@CDIAIII=;=@>D>EGIRRRI=E=EEH9C9D=;==;<2:99DCCHEAE=IHH@>9;?<<<=::9;<9<AAHE=EAIIE>;9867=HNNIAAN35859=EA=A==ANEAD>>C<9229A9>C>IA=>H>8H:E<>;88;EI=9;9;;<AA>>CIPPPEHA8<==?;=INHA=@9??=CIIA=8II>>9>@>IEEHICAAHEE?=;;;9=IAI;?=;;9EIHIIINAIAAAHD=EA:><>C;=IHRAA8:><8AA<==9C>RNNGHIAIE>;9>>>HIIAAMIEMIAAI=NI=8=AEHEEH<<51>8<H69<<EHGNH=AH63;18IACIIII<><1<HIAEA=H?9>>>>IAIEII=IIREAA>668;<96;C<>>;<5696:583/<:9IIIMA886469==E=G;@;<53:.835>>8870<<<@;=AHAA=HIEMIEDD9:>CEA=II>DB>DCIEAC>975269C<CCCMMB@@>>=;5/+436BADMII>CCIMMPR@@?C<8HHA=IAIIACCCG>C;8<;;=IE><?7;;98AENNN@<>7799=79;2777?>IBB:==754-/*3.9/.+*+-7>ND>C?C=:EEC9??9=7;;;9=A9?C<CCD;B?87843;756;;;/0+*+*+,3//**-63>771.1**0,1/*+4/*,,3-+034-0**++230++1-2+*,**.**++*)-+*+(*-+,//+*1/533,++*..-*****+6-.+***-++++/1*-*+*+*+.**-+*******,+*,**+(++**+**..,*****+/+*********+**''*'**+*)***+*****++++++***'*****************'**
ADD REPLYlink modified 1 day ago • written 1 day ago by sullis020
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1661 users visited in the last hour