How to generate reverse complement for the fasta sequences specified in the given csv file?
1
0
Entering edit mode
3.6 years ago
Kumar ▴ 120

I have a list.csv file contains series of coordinates and the concerned coordinates are belongs to different fasta files namely ocean.fasta, lake.fasta, river.fasta which are present in a target folder. The list.csv and ocean.fasta, lake.fasta, river.fasta files are shown below,

.

/target/
 list.csv
 Contig3,15,7
 Contig2,5,10
 Xantho1,12,3

 ocean.fasta 
>Contig2 contig1 Bacillus 985, ocean [298]
ACGAAGCGCTCGCCAAGGCCGAAGAAGAAGGCC
>Contig85 Bacillus 956, ocean [895]
ATGCNNNGCTAT

lake.fasta
>Xantho1 [Pseudomonas] cissicola strain CCUG 18839 contig_0000001
AAGGCCATCAAGGACGTGGATGAGGTCGTCAAG
>Contig8 [Pseudomonas] cissicola strain CCUG 18839 contig_0000008
ATGCTTAGCTGATGC
>Contig20 [Pseudomonas] cissicola strain CCUG 18839 contig_0000020
ATGCTTAGCTGATGCTAGTA

river.fasta
>Contig3 8954 e.coli [856]
GCTGCGGCGCTGATCCTGGCGGCCCGCGCCGAG
>Contig8 8954 e.coli [859] 
TAGTGCGTATAT

Contig3,15,7 and Xantho1,12,3 are not in right order, I mean the $2>$3, therefore I need to order those coordinates as $2<$3. Further I need to reverse complement those sequences extracted from Xantho1,3,12 and Contig3,7,15. In addition to that, I need to save those extracted sequences as new_ocean.fasta, new_lake.fasta, new_river.fast in a fresh folder namely target_sequences. The expected output as follows,

./target/target_sequences

new_river.fasta
>Contig3    
GATCAGCGC

new_ocean.fasta 
>Contig2
AGCGCT

 new_lake.fasta
 >Xantho1
 CTTGATGGCC

I have used following script but end up with an error,

for file in *.fasta
do
fastaexplode "$file" &&
awk -F[:-] '{ 
                if($2>$3){
                    start=$3-1; 
                    len=$2-start" -"
                }
                else{
                    start=$2-1; 
                    len=$3-start
                } 
                print $1,start,len}' list.csv &&
tmpFile=$(mktemp); 
    > subseqs.fa
    awk -F[:-] '{ 
                if($2>$3){
                    start=$3-1; 
                    len=$2-start" -"
                }
                else{
                    start=$2-1; 
                    len=$3-start
                } 
                print $1,start,len}' list.csv |
    while read cont start len rev; do 
        fastasubseq "$cont".fa $start $len > $tmpFile; 
        if [[ -n $rev ]]; then 
            fastarevcomp "$tmpFile" >> subseqs.fa; 
        else 
            cat "$tmpFile" >> subseqs.fa; 
    fi && cp subseqs.fa target_sequences/"new_${file}" 
done

Please help me to do the same.

python perl bash shell fasta • 2.0k views
ADD COMMENT
0
Entering edit mode

Did you write the awk script from scratch? At what stage of writing the script did it stop working and give you the error?

ADD REPLY
0
Entering edit mode

@RamRS It shows error at last stage. It extracts the sequence and reverse complementing where ever is required, but failed to create new_file. I know that I messed up the script while adopting my purpose. if it possible could you please help me to correct it.

ADD REPLY
2
Entering edit mode
3.6 years ago
JC 13k

Perl:

#!/usr/bin/perl

# getFileSeq.pl

$ARGV[1] or die "use getFileSeq.pl <CSV> <FASTA> ... <FASTA_n>\n";

my $csv_file = shift @ARGV;
my @file_list = @ARGV;

my %seqs = ();
warn "reading CSV file $csv_file\n";
open (my $csv, "<", "$csv_file") or die "cannot read $csv_file\n";
while (<$csv>) {
    chomp;
    my ($id, $ini, $end) = split (/,/, $_);
    $seqs{$id}{'ini'} = $ini;
    $seqs{$id}{'end'} = $end;
}
close $csv;

foreach my $fasta (@file_list) {
    warn "processing Fasta $fasta\n";
    open (my $new, ">", "new_$fasta") or die "cannot write new_$fasta\n";
    open (my $src, "<",     "$fasta") or die "cannot read $fasta";
    $/="\n>";
    while (<$src>) {
        s/>//g;
        my ($name, @seq) = split(/\n/, $_);
        if ($name =~ /^(\w+)/) {
            my $id = $1;
            if (defined $seqs{$id}) {
                my $seq = join "", @seq;
                my $ini = $seqs{$id}{'ini'};
                my $end = $seqs{$id}{'end'};
                my $new_seq = '';
                if ($ini > $end) {
                    $seq = reverse $seq;
                    $seq =~ tr/ACGTacgt/TGCAtgca/;
                    $new_seq = substr($seq, $end-1, $ini-$end);
                }
                else {
                    $new_seq = substr($seq, $ini-1, $end-$ini);
                }
                print $new ">$id\n$new_seq\n";
            }
        }
    }
    close $new;
    close $src;
}

Testing it (btw, check your reverse complement):

$ perl getFileSeq.pl list.csv lake.fasta ocean.fasta river.fasta
reading CSV file list.csv
processing Fasta lake.fasta
processing Fasta ocean.fasta
processing Fasta river.fasta
$ cat new_ocean.fasta
>Contig2
AGCGC
$ cat new_river.fasta
>Contig3
GCGGGCCG
$ cat new_lake.fasta
>Xantho1
TGACGACCT
ADD COMMENT
0
Entering edit mode

@JC Thank you for your help. Your script extracting the ordered coordinates perfectly, but in terms of reverse complement its doing inverse manner. Instead of extracting the sequence from 5' to 3' direction, its extracting from 3' to 5' direction and performing reverse complement for the same. Is it possible to correct the same.

ADD REPLY
1
Entering edit mode

the script is doing reverse complement for the sequence, what do you want? reverse or complement?

ADD REPLY
0
Entering edit mode

Sorry for not making you @JC understand properly. I need to reverse complement the right ordered corresponding coordinates only. As I mentioned in my question Contig2,5,10 are in right order, so the sequences to be extracted from 5' to 3' direction and saved as it is without making reverse complement. But in terms of Contig3,15,7 and Xantho1,12,3, first the coordinates should be changed/corrected as Contig3,7,15 and Xantho1,3,12 and then the sequences to be extracted from 5' to 3' direction further the extracted sequences to be reverse complemented and saved in new_file. The same I have mentioned in my expected output file.

ADD REPLY
0
Entering edit mode

First the sequences to be extracted as GCGCTGATC from Contig3 from river.fasta for the corrected coordinates Contig3,7,15 (after correction/ordering) then the extracted sequences to be reverse complemented as GATCAGCGC. Same way it should be for Xantho1,3,12. If I failed to make it clear, please let me know.

ADD REPLY
1
Entering edit mode

if you change this

            if ($ini > $end) {
                $seq = reverse $seq;
                $seq =~ tr/ACGTacgt/TGCAtgca/;
                $new_seq = substr($seq, $end-1, $ini-$end);
            }

to:

            if ($ini > $end) {
                $new_seq = substr($seq, $end-1, $ini-$end);
                $new_seq = reverse $new_seq;
                $new_seq =~ tr/ACGTacgt/TGCAtgca/;
            }
ADD REPLY
0
Entering edit mode

@JC The revised script is extracting the sequences with one less base for Contig3,15,7 and Xantho1,12,3 and reverse complementing the extracted sequences, saving as new_. files. I have fixed this small issue by modifying your script like this $new_seq = substr($seq, $end-1, $ini-$end+1);

But, its doing nothing for Contig2,5,10 (which is supposed to be extracted as it is without reverse complementing), the script printing Contig2 only without any sequences in new_ocean.fasta file. Please help me to fix this issue.

ADD REPLY
1
Entering edit mode

did you remove this?

               else {
                    $new_seq = substr($seq, $ini-1, $end-$ini);
                }

that is the part that works when the sequence doesn't need any change.

Please share how your code looks now.

ADD REPLY
0
Entering edit mode

@JC As you mentioned I have deleted this else { $new_seq = substr($seq, $ini-1, $end-$ini); } while changing the script. Now it works perfectly fine for my test data, which is shown in my question.

However, when I use this script for my real data, I am facing another problem. Whichever fasta header is having . it fails to extract the sequence. I mean, the fasta header is like this >18541.3 the script doing nothing, which print empty output file. Unfortunately, my real fasta files contain . as part of the header. Once again sorry to mention this, is it possible to fix this issue.

ADD REPLY
0
Entering edit mode

@ JC. My revised code is shown below.

#!/usr/bin/perl

    # getFileSeq.pl

    $ARGV[1] or die "use getFileSeq.pl <CSV> <FASTA> ... <FASTA_n>\n";

    my $csv_file = shift @ARGV;
    my @file_list = @ARGV;

    my %seqs = ();
    warn "reading CSV file $csv_file\n";
    open (my $csv, "<", "$csv_file") or die "cannot read $csv_file\n";
    while (<$csv>) {
        chomp;
        my ($id, $ini, $end) = split (/,/, $_);
        $seqs{$id}{'ini'} = $ini;
        $seqs{$id}{'end'} = $end;
    }
    close $csv;

    foreach my $fasta (@file_list) {
        warn "processing Fasta $fasta\n";
        open (my $new, ">", "new_$fasta") or die "cannot write new_$fasta\n";
        open (my $src, "<",     "$fasta") or die "cannot read $fasta";
        $/="\n>";
        while (<$src>) {
            s/>//g;
            my ($name, @seq) = split(/\n/, $_);
            if ($name =~ /^(\w+)/) {
                my $id = $1;
                if (defined $seqs{$id}) {
                    my $seq = join "", @seq;
                    my $ini = $seqs{$id}{'ini'};
                    my $end = $seqs{$id}{'end'};
                    my $new_seq = '';
                    if ($ini > $end) {
                    $new_seq = substr($seq, $end-1, $ini-$end+1);
                    $new_seq = reverse $new_seq;
                    $new_seq =~ tr/ACGTacgt/TGCAtgca/;
                }
                    else {
                        $new_seq = substr($seq, $ini-1, $end-$ini);
                    }
                    print $new ">$id\n$new_seq\n";
                }
            }
        }
        close $new;
        close $src;
    }
ADD REPLY
1
Entering edit mode

change

        if ($name =~ /^(\w+)/) {

to

        if ($name =~ /^(.+)\s/) {
ADD REPLY
0
Entering edit mode

@JC now my script is given below by adopting changes as you mentioned

#!/usr/bin/perl

# getFileSeq.pl

$ARGV[1] or die "use getFileSeq.pl <CSV> <FASTA> ... <FASTA_n>\n";

my $csv_file = shift @ARGV;
my @file_list = @ARGV;

my %seqs = ();
warn "reading CSV file $csv_file\n";
open (my $csv, "<", "$csv_file") or die "cannot read $csv_file\n";
while (<$csv>) {
    chomp;
    my ($id, $ini, $end) = split (/,/, $_);
    $seqs{$id}{'ini'} = $ini;
    $seqs{$id}{'end'} = $end;
}
close $csv;

foreach my $fasta (@file_list) {
    warn "processing Fasta $fasta\n";
    open (my $new, ">", "new_$fasta") or die "cannot write new_$fasta\n";
    open (my $src, "<",     "$fasta") or die "cannot read $fasta";
    $/="\n>";
    while (<$src>) {
        s/>//g;
        my ($name, @seq) = split(/\n/, $_);
        if ($name =~ /^(.+)\s/) {
            my $id = $1;
            if (defined $seqs{$id}) {
                my $seq = join "", @seq;
                my $ini = $seqs{$id}{'ini'};
                my $end = $seqs{$id}{'end'};
                my $new_seq = '';
                if ($ini > $end) {
                $new_seq = substr($seq, $end-1, $ini-$end+1);
                $new_seq = reverse $new_seq;
                $new_seq =~ tr/ACGTacgt/TGCAtgca/;
            }
                else {
                    $new_seq = substr($seq, $ini-1, $end-$ini);
                }
                print $new ">$id\n$new_seq\n";
            }
        }
    }
    close $new;
    close $src;
}

However, the revised script also printing empty file without any sequence. There is no improvement.

ADD REPLY
0
Entering edit mode
cor.csv
3.1,15,7
2.1,5,10
1.1,12,3

lake.fasta
>1.1 [Pseudomonas] cissicola strain CCUG 18839 contig_0000001
AAGGCCATCAAGGACGT
GGATGAGGTCGTCAAG
>8.1 [Pseudomonas] cissicola strain CCUG 18839 contig_0000008
ATGCTTAGCTGATGC
>20.1 [Pseudomonas] cissicola strain CCUG 18839 contig_0000020
ATGCTTAGCTGATGCTAGTA

ocean.fasta
>2.1 contig1 Bacillus 985, ocean [298]
ACGAAGCGCTCGCCAAGGCCGAAGAAGAAGGCC
>85.1 Bacillus 956, ocean [895]
ATGCNNNGCTAT

    river.fasta
    >3.1 8954 e.coli [856]
    GCTGCGGCGCTGATCCTGGCGGCCCGCGCCGAG
    >8.1 8954 e.coli [859] 
    TAGTGCGTATAT

Just for an example I am providing the above test data.

ADD REPLY
1
Entering edit mode

The line:

    if ($name =~ /^(.+)\s/) {

should be:

    if ($name =~ /^(.+?)\s/) {

Testing:

$ perl getFileSeq.pl cor.csv *fasta
reading CSV file cor.csv
processing Fasta lake.fasta
processing Fasta ocean.fasta
processing Fasta river.fasta
$ more new*
::::::::::::::
new_lake.fasta
::::::::::::::
>1.1
CTTGATGGCC
::::::::::::::
new_ocean.fasta
::::::::::::::
>2.1
AGCGC
::::::::::::::
new_river.fasta
::::::::::::::
>3.1
GATCAGCGC
ADD REPLY
0
Entering edit mode

@JC Thank you so much for your time, consideration and help. Now the script works perfectly. Once again thank you.

ADD REPLY

Login before adding your answer.

Traffic: 1511 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6