Question: how to subset sequences based on IDs, but order shall remain preserved?
0
gravatar for majeedaasim
7 months ago by
majeedaasim20
United States
majeedaasim20 wrote:

I have a fasta file and an ID list file. I try to subset fasta sequences from the fasta file based on ID list using

seqtk subseq <fastafile> ID_list > out_file

However in the output file the order of the sequences is not preserved. I need the sequences in the outfile in the same order as in the ID list . Thanks

seqtk • 403 views
ADD COMMENTlink modified 7 months ago by genomax55k • written 7 months ago by majeedaasim20
0
gravatar for prakash
7 months ago by
prakash480
prakash480 wrote:

I hope this perl script will solve your purpose Usage : perl script.pl Id_file file.fasta Output.fasta

#!/usr/bin/perl

@file1 = `cat $ARGV[0]`;
@file2 = `cat $ARGV[1]`;
%hash;
$id ;
foreach $fasta(@file2){
        chomp($fasta);
        if($fasta =~ '>'){
                $id = $fasta;}
        if($fasta !~ '>'){
                push @{$hash{$id}},$fasta;      }
}
open(OUT,">$ARGV[2]") or die;
foreach $id(@file1){
        chomp($id);
        $id =~s/^/>/;
        #print "$id\n";
        @seq;
        print  OUT "$id\n";
        foreach $key(keys %hash){
                if($id eq $key){
                        @seq = @{$hash{$key}};}}
                for($i =0; $i<scalar(@seq);$i++){
                        print OUT "$seq[$i]\n";}
}
ADD COMMENTlink modified 7 months ago • written 7 months ago by prakash480

Thanks prakash but it produces an unusual result It doesn't subset properly

ADD REPLYlink written 7 months ago by majeedaasim20

If I am not wrong you need fasta sequence for selected Id ? right?

ADD REPLYlink written 7 months ago by prakash480

yes I have fasta sequence file and an ID file. THe sequences need to be extracted from the sequence file according to the entries in the ID file but the order of sequences in the outfile shall be same as ID file entries.

ADD REPLYlink written 7 months ago by majeedaasim20

can you please provide few example of Id ?

ADD REPLYlink written 7 months ago by prakash480

sequence file

>abc
atgcggatacagataca
>def
atgcggccaaaatt
>ghi
atggggaattcacaac
>jkl
aggatattatatccg

Now filter according to Ids list like

abc
jkl
ADD REPLYlink modified 7 months ago • written 7 months ago by majeedaasim20

I have edited the script, I hope its working fine now

ADD REPLYlink written 7 months ago by prakash480

I tried it but it produces the same contents as ID file. NO sequences are extracted.

ADD REPLYlink modified 7 months ago • written 7 months ago by majeedaasim20

I think your Id file contains sequence id as >abc , >jkl. So accordingly i have edited the script. If that is the case , then the script should give intended result.

ADD REPLYlink written 7 months ago by prakash480

My ID file entries are like

Tw_Wn_DN127031_c0_g1_i5
Tw_Wn_DN118882_c0_g1_i2
Tw_Wn_DN121684_c0_g1_i2
Tw_Wn_DN127037_c0_g1_i1
Tw_Wn_DN134530_c2_g4_i7
Tw_Wn_DN117797_c0_g1_i1

the script says Died at script.pl line 14.

ADD REPLYlink written 7 months ago by majeedaasim20

The above script should work for input file provided below:

Fasta_file

>EG01
AATTCAGAAAAGTGCCATAAAACAGGCAAATATAGCACCTATCTTCGATTCCCGATTGAGAAAGCTCGAATCATTAAGAA
GATGTAGGTTGGGTTGAGGAACGAGACCCAGCAATTCGATTCCATGTTGGATTAAACAATATCAACCAAGAGTTTGATGT
>EG02
CAGGATAACTCCAATTTTTCTTTTAGAGTCAAGCAGCCATTACAAGGAAACACAATTAGATTGGGCTAGTCCTTACCTTA
AAAAGAATTGTCACAAGTTCCTTGGCTGTAAAAGGTTTTAATAGGAATGCTGCAACTCCACCACTTTTTTCTACTTCTGC

Sequence_id_file

EG01

Output_filename

>EG01
AATTCAGAAAAGTGCCATAAAACAGGCAAATATAGCACCTATCTTCGATTCCCGATTGAGAAAGCTCGAATCATTAAGAA
GATGTAGGTTGGGTTGAGGAACGAGACCCAGCAATTCGATTCCATGTTGGATTAAACAATATCAACCAAGAGTTTGATGT

Follow below usage to avoid any error

perl script usage : perl script.pl Sequence_id_file  Fasta_file Output_filename
ADD REPLYlink modified 7 months ago • written 7 months ago by prakash480
0
gravatar for Hugo
7 months ago by
Hugo130
Universidade de Vigo, Ourense (Spain)
Hugo130 wrote:

Hello,

we are currently developing SEDA http://www.sing-group.org/seda, a multi-platform application for processing and filtering FASTA files. I guess that what you need can be achieved with the "Pattern filtering" option, section 3.2 of the manual: http://www.sing-group.org/seda/downloads/manuals/seda-user-manual-1.0.0.pdf. In this option, you should select "Header" and then load your ID list using the "Import patterns" button. This way only sequences that contains a ID in your list will be kept.

Regards,

Hugo.

ADD COMMENTlink modified 7 months ago • written 7 months ago by Hugo130

Thanks, but link is not working

ADD REPLYlink written 7 months ago by majeedaasim20

They should be working now.

ADD REPLYlink written 7 months ago by Hugo130
0
gravatar for genomax
7 months ago by
genomax55k
United States
genomax55k wrote:

I am certain faSomeRecords utility from Jim Kent (UCSC) will do what you need.

ADD COMMENTlink written 7 months ago by genomax55k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 606 users visited in the last hour