Question: working with Sanger fasta + qual + clip files
0
gravatar for sullis02
6 months ago by
sullis020
sullis020 wrote:

I have an old Sanger fasta file and its associated qual file, as well as a 'clip' file that shows where to trim low-quality sequence for each fasta sequence. e.g. for the first record in the fasta file here are the corresponding data in each file:

fasta:

>gnl|ti|713904976 1047095293492
CCCCAGAAGCCCCGCGCTTTTTCCATTTCACGACGTTGTAAAACGACGGCCAGTGAATTGTAATACGACT
CACTATAGGGCGAATTCGAGCTCGGTACCCGGGGATCCCACGTACGCTTCATCATCGCCCTTTTCAATTT
CTGCATATTCCGATGGCTTGTATGAGACAAACGCATTCATTTTCCAGCTTTTATCCAAAGCAAGTTACTC
AACTCCATATTATTCAAGTTTGATCCACTCCAGGGCCATCTTTTCTCGATGGACAACAGGGGACATCATT
GATGGATCGGATGACGGAGGTGCTTCTGTTTCTGCAAAGAAAACTCTGCAAAAAGACGTAAATGAAACAG
GAGGAATCTTCAGAGGAACAAGAAGTTACATGGACTTCCCATTTGCAAACTACAGAGATACTTCCAACTC
GGTTTTGAACCTCGGACCAAGTGAATCC

qual:

>gnl|ti|713904976 1047095293492
  9  9  9  8  6  9  6  7 10 10 10 10  9  9  7  9  7 10 10  9
  9 10  9  9  9  9  9  9 12 12 10  9  9  9  9 12 21 12 13 17
 20 21 20 21 28 25 24 30 21 25 21 21 23 23 31 28 25 28 23 23
 29 23 44 33 33 29 34 34 26 29 32 32 32 32 40 40 38 29 45 35
 28 23 29 29 22 22 22 26 32 29 32 29 29 29 30 30 33 38 33 25
 25 35 20 20 22 20 20 29 29 27 24 22 30 24 27 24 34 25 34 38
 29 31 29 23 24 24 23 32 32 40 39 39 27 19 21 14 22 22 22 45
 40 26 44 38 28 28 26 40 32 34 36 40 36 40 39 27 21 23 21 39
 39 32 28 32 40 26 32 26 25 35 34 32 40 40 28 36 40 38 39 39
 36 32 29 28 28 28 28 29 45 40 36 40 34 29 32 24 18 18 23 27
 32 36 47 47 38 33 25 30 24 30 20 19 24 24 27 28 27 26 18 29
 25 23 27 27 21 15 22 23 30 29 35 38 40 31 34 24 21 26 23 18
 17 19 19 22 28 24 24 26 36 36 28 40 32 40 28 23 32 22 24 22
 22 22 22 29 29 26 39 40 29 32 24 28 29 29 40 36 32 28 29 24
 24 24 20 28 31 40 38 38 34 40 38 27 39 24 18 22 22 24 16 26
 23 39 40 28 28 49 49 32 27 27 23 28 29 27 44 34 29 25 20 27
 27 29 23 23 39 32 49 40 28 25 28 39 28 28 23 39 32 32 40 28
 23 27 27 23 27 36 39 17 17 26 49 49 32 39 39 39 28 28 19 18
 21 14 13 18 11 18 27 25 23 17 23 25 27 23 24 17 23 23 23 21
 39 27 25 32 32 36 32 18 16 12 23 19 20 29 26 28 21 26 26 26
 27 29 23 16 12 25 18 23 24 24 18 23 27 23 28 27 23 13 15 13
 13 14 13 20 11 15 25 23 27 24 25 27 24 19 23 27 23 27 23 20
 18 27 27 27 23 10 15 14

clip:

TI      CLIP_LEFT       CLIP_RIGHT
713904976       112     445

I'd like to generate a fastq file consisting of the trimmed fasta sequences + their qual scores. (Or, generate the fastq file of the whole fasta records first, then trim it using the clip data as instructions. )

What tools can do that?

fastq trimming sanger • 272 views
ADD COMMENTlink modified 6 months ago by JC8.4k • written 6 months ago by sullis020
0
gravatar for JC
6 months ago by
JC8.4k
Mexico
JC8.4k wrote:

A simple script can do the trick:

#!/usr/bin/perl

use strict;
use warnings;

$ARGV[2] or die "use convert.pl FASTA QUAL CLIP > FASTQ\n";

my $phred = 33; # Define Phred scale, it can be 33 or 64

my $seq_file  = shift @ARGV;
my $qual_file = shift @ARGV;
my $clip_file = shift @ARGV;

my ($id, $seq, $qual, $c_ini, $c_end);

# Read fasta sequence
open (my $fh, "<", $seq_file) or die "cannot read $seq_file\n";
while (<$fh>) {
    chomp;
    if (/>(.+)/) {
       $id = $1;
    }
    else {
       $seq .= $_;
    }
}
close $fh;

# Read quals
open (my $qh, "<", $qual_file) or die "cannot read $qual_file\n";
while (<$qh>) {
    chomp;
    next if (/>/);
    s/^\s+//; # remove spaces at the begin
    s/\s+$//; # remove spaces at the end
    my @values = split (/\s+/, $_);
    foreach my $v (@values) {
        $qual .= chr($v + $phred);
    }
}
close $qh;

# check is sizes are equal
die "Sequence and quality lenghts are not equal\n" if (length $seq != length $qual);

# Read clip coordinates
open (my $ch, "<", $clip_file) or die "cannot read $clip_file\n";
while (<$ch>) {
    chomp;
    next if (/^TI/); # skip header
    my @elem = split (/\s+/, $_);
    #verify if the id is the same for the sequence
    if ($id =~ /$elem[0]/) {
        $c_ini = $elem[1];
        $c_end = $elem[2];
        # clip sequences
        $seq  = substr($seq,  $c_ini - 1, $c_end - $c_ini);
        $qual = substr($qual, $c_ini - 1, $c_end - $c_ini);
    }
    else {
        warn "clip id doesn't match sequence id, printing sequence without modifications\n";
    }
    last; #only one line is allowed to modify the sequence
}
close $ch;

print "\@$id\n$seq\n+\n$qual\n";

Example:

$ perl convert.pl seq.fasta seq.qual seq.clip
@gnl|ti|713904976 1047095293492
GTACGCTTCATCATCGCCCTTTTCAATTTCTGCATATTCCGATGGCTTGTATGAGACAAACGCATTCATTTTCCAGCTTTTATCCAAAGCAAGTTACTCAACTCCATATTATTCAAGTTTGATCCACTCCAGGGCCATCTTTTCTCGATGGACAACAGGGGACATCATTGATGGATCGGATGACGGAGGTGCTTCTGTTTCTGCAAAGAAAACTCTGCAAAAAGACGTAAATGAAACAGGAGGAATCTTCAGAGGAACAAGAAGTTACATGGACTTCCCATTTGCAAACTACAGAGATACTTCCAACTCGGTTTTGAACCTCGGACCAAGTGA
+
7?9<9C:CG>@>8998AAIHH<46/777NI;MG==;IACEIEIH<686HHA=AI;A;:DCAII=EIGHHEA>====>NIEIC>A9338<AEPPGB:?9?5499<=<;3>:8<<6078?>DGI@C96;832447=99;EE=IAI=8A797777>>;HI>A9=>>IEA=>9995=@IGGCIG<H937791;8HI==RRA<<8=><MC>:5<<>88HARI=:=H==8HAAI=8<<8<EH22;RRAHHH==436/.3,3<:828:<8928886H<:AAEA31-845>;=6;;;<>81-:389938<8=<8.0../.5,0:8<9:<948<8<853<<<
ADD COMMENTlink written 6 months ago by JC8.4k

Thank you very much!!

ADD REPLYlink written 5 months ago by sullis020

Oops, wrote to soon. Have you tried your script with input files that contain more than a single fasta/clip/qual record? With a single record as input (as in your example) it works fine. With normal fasta /qual/clip files (which contain numerous records) it doesn't work it reports ' clip id doesn't match sequence id, printing sequence without modifications' and then concatenates all the fasta and qual text into giant globs.

e.g the input fasta here has two records, as do the qual and clip files

1-2.fasta
>gnl|ti|713904976 1047095293492
CCCCAGAAGCCCCGCGCTT...etc....GTGAATCC
>gnl|ti|713904977 1047095293493
TCGGGAGGCCAGGGGTTTT...etc....ATATCTC

1-2.qual
>gnl|ti|713904976 1047095293492
  9  9  9  8  6  9  6  7 10 10 10 10  9  9  7  9  7 10 10  9  etc...

>gnl|ti|713904977 1047095293493
  9  9 10 10  9  9  9  9 10 12  9 10 11 20 17  7 10 15  9 10 etc....

1-2.clip
TI      CLIP_LEFT       CLIP_RIGHT
713904976       112     445
713904977       108     805

   $ perl convert.pl  001_1-2.fasta 001_1-2.qual 001_1-2.clip
    clip id doesn't match sequence id, printing sequence without modifications
    @713904977 1047095293493
    CCCCAGAAGCCCCGCGCTTTTTCCATTTCACGACGTTGTAAAACGACGGCCAGTGAATTGTAATACGACTCACTATAGGGCGAATTCGAGCTCGGTACCCGGGGATCCCACGTACGCTTCATCATCGCCCTTTTCAATTTCTGCATATTCCGATGGCTTGTATGAGACAAACGCATTCATTTTCCAGCTTTTATCCAAAGCAAGTTACTCAACTCCATATTATTCAAGTTTGATCCACTCCAGGGCCATCTTTTCTCGATGGACAACAGGGGACATCATTGATGGATCGGATGACGGAGGTGCTTCTGTTTCTGCAAAGAAAACTCTGCAAAAAGACGTAAATGAAACAGGAGGAATCTTCAGAGGAACAAGAAGTTACATGGACTTCCCATTTGCAAACTACAGAGATACTTCCAACTCGGTTTTGAACCTCGGACCAAGTGAATCCTCGGGAGGCCAGGGGTTTTTCCAGTCACGACGTTGTAAACGACGGCCAGTGAATTGTAATACGACTCACTATAGGGCGAATTCGAGCTCGGTACCCGGGGATCCCACGTACGCTTCATCATCGCCCTTTTCAATTTCTGCATATTCCGATGGCTTGTATGAGACAAACGCATTCATTTTCCAGCTTTTATCCAAAGCAAGTTACTCAACTCCATATTATTCAAGTTTGATCCACTCCAGGGCCATCTTTTCTCGATGGACAACAGGGGACATCATTGATGGATCGGATGACGGAGGTGCTTCTGTTTCTGCAAAGAAAACTCTGCAAAAAGACGTAAATGAAACAGGAGGAATCTTCAGAGGAACAAGAAGTTACATGGACTTCCCATTTGCAAACTACAGAGATACTTCCAACTCGGTTTTGAACCTCGGACCAAGTGAATCCAAACTGAATTTCACTACTGCTCCAACAAATGCACCAGGAGGACTTTCTGGAGTTCACATATTCCTAATTGTACTTGCAGTGATTGTTGTAGTAGTTGTTTGCGTAGTAGTTGTATGCTTAGTTCACAGATCCAACCAATCAAAGGCGGAAATGAACAAAAAGTTCAACTCAGACAAAGGTGATGAACTATCTGATGTAGAGGACAAAGAAACTCAACAGACTGGGTACATGTCTGGAACTCCTTCTCCTCAAAGCTCTGCTGCAACTACTACTACTGGTCATAACTCTGCAACTTCTACTGTGAATAGCACGGGAGTTGTTCCGGCTGATCCAAACTTGAGGGTGTGGAATCCGTACGAGATTTCGCCGACTACAACAACTAATGAAAAATAGTTTTGTTTATTCATGATATTGTTTCGTTAGTTGCTATTGTTAATTGTTCTATTTATAAATATAAGTTCTGTTTGTTTTATTTTTCTTATCGATGCTTCCTTCCTAAATAGAAATGTTTTTTTTTAGATCATAATATGATCAAAATTAAATTATGATTGACATGCGCAATTGACAAATTTTTCTTGGAAACAGGATGTATTTGTAATTATATCTTTAAAGAGCCTGATGCATGATATCTC
    +
    ***)'*'(++++**(*(++**+******--+****-6-.25656=:9?6:6688@=:=88>8MBB>CC;>AAAAIIG>ND=8>>777;A>A>>>??BGB::D55755>><97?9<9C:CG>@>8998AAIHH<46/777NI;MG==;IACEIEIH<686HHA=AI;A;:DCAII=EIGHHEA>====>NIEIC>A9338<AEPPGB:?9?5499<=<;3>:8<<6078?>DGI@C96;832447=99;EE=IAI=8A797777>>;HI>A9=>>IEA=>9995=@IGGCIG<H937791;8HI==RRA<<8=><MC>:5<<>88HARI=:=H==8HAAI=8<<8<EH22;RRAHHH==436/.3,3<:828:<8928886H<:AAEA31-845>;=6;;;<>81-:389938<8=<8.0../.5,0:8<9:<948<8<853<<<8+0/**++****+-*+,52(+0*+*+**--112=EENNEIENENC>>>355333>;:689>8M>?>=CB<::B@DI>>C>9D:<798<<8EA9MB8>;BB?G:;;<:C=>97??B>I79968<:==8AAAEI:887<5>E?9=;9=>AAA=AHAHA8==HAENAH8<<:<9=8=<<<:5<<IIAIAAAAAINPAA:5999>:4GG@>>;8888CADCEHEII=;;59=EEDGG@CDIAIII=;=@>D>EGIRRRI=E=EEH9C9D=;==;<2:99DCCHEAE=IHH@>9;?<<<=::9;<9<AAHE=EAIIE>;9867=HNNIAAN35859=EA=A==ANEAD>>C<9229A9>C>IA=>H>8H:E<>;88;EI=9;9;;<AA>>CIPPPEHA8<==?;=INHA=@9??=CIIA=8II>>9>@>IEEHICAAHEE?=;;;9=IAI;?=;;9EIHIIINAIAAAHD=EA:><>C;=IHRAA8:><8AA<==9C>RNNGHIAIE>;9>>>HIIAAMIEMIAAI=NI=8=AEHEEH<<51>8<H69<<EHGNH=AH63;18IACIIII<><1<HIAEA=H?9>>>>IAIEII=IIREAA>668;<96;C<>>;<5696:583/<:9IIIMA886469==E=G;@;<53:.835>>8870<<<@;=AHAA=HIEMIEDD9:>CEA=II>DB>DCIEAC>975269C<CCCMMB@@>>=;5/+436BADMII>CCIMMPR@@?C<8HHA=IAIIACCCG>C;8<;;=IE><?7;;98AENNN@<>7799=79;2777?>IBB:==754-/*3.9/.+*+-7>ND>C?C=:EEC9??9=7;;;9=A9?C<CCD;B?87843;756;;;/0+*+*+,3//**-63>771.1**0,1/*+4/*,,3-+034-0**++230++1-2+*,**.**++*)-+*+(*-+,//+*1/533,++*..-*****+6-.+***-++++/1*-*+*+*+.**-+*******,+*,**+(++**+**..,*****+/+*********+**''*'**+*)***+*****++++++***'*****************'**
ADD REPLYlink modified 5 months ago • written 5 months ago by sullis020

The script is intended to work on one sequence per file, to use with a multifasta, the data needs to be saved in memory in a Hash.

ADD REPLYlink written 5 months ago by JC8.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 809 users visited in the last hour