Question: how to subset sequences based on IDs, but order shall remain preserved?
0
gravatar for majeedaasim
2.4 years ago by
majeedaasim40
United States
majeedaasim40 wrote:

I have a fasta file and an ID list file. I try to subset fasta sequences from the fasta file based on ID list using

seqtk subseq <fastafile> ID_list > out_file

However in the output file the order of the sequences is not preserved. I need the sequences in the outfile in the same order as in the ID list . Thanks

seqtk • 1.1k views
ADD COMMENTlink modified 2.4 years ago by genomax85k • written 2.4 years ago by majeedaasim40
0
gravatar for Prakash
2.4 years ago by
Prakash1.9k
India
Prakash1.9k wrote:

I hope this perl script will solve your purpose Usage : perl script.pl Id_file file.fasta Output.fasta

#!/usr/bin/perl

@file1 = `cat $ARGV[0]`;
@file2 = `cat $ARGV[1]`;
%hash;
$id ;
foreach $fasta(@file2){
        chomp($fasta);
        if($fasta =~ '>'){
                $id = $fasta;}
        if($fasta !~ '>'){
                push @{$hash{$id}},$fasta;      }
}
open(OUT,">$ARGV[2]") or die;
foreach $id(@file1){
        chomp($id);
        $id =~s/^/>/;
        #print "$id\n";
        @seq;
        print  OUT "$id\n";
        foreach $key(keys %hash){
                if($id eq $key){
                        @seq = @{$hash{$key}};}}
                for($i =0; $i<scalar(@seq);$i++){
                        print OUT "$seq[$i]\n";}
}
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Prakash1.9k

Thanks prakash but it produces an unusual result It doesn't subset properly

ADD REPLYlink written 2.4 years ago by majeedaasim40

If I am not wrong you need fasta sequence for selected Id ? right?

ADD REPLYlink written 2.4 years ago by Prakash1.9k

yes I have fasta sequence file and an ID file. THe sequences need to be extracted from the sequence file according to the entries in the ID file but the order of sequences in the outfile shall be same as ID file entries.

ADD REPLYlink written 2.4 years ago by majeedaasim40

can you please provide few example of Id ?

ADD REPLYlink written 2.4 years ago by Prakash1.9k

sequence file

>abc
atgcggatacagataca
>def
atgcggccaaaatt
>ghi
atggggaattcacaac
>jkl
aggatattatatccg

Now filter according to Ids list like

abc
jkl
ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by majeedaasim40

I have edited the script, I hope its working fine now

ADD REPLYlink written 2.4 years ago by Prakash1.9k

I tried it but it produces the same contents as ID file. NO sequences are extracted.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by majeedaasim40

I think your Id file contains sequence id as >abc , >jkl. So accordingly i have edited the script. If that is the case , then the script should give intended result.

ADD REPLYlink written 2.4 years ago by Prakash1.9k

My ID file entries are like

Tw_Wn_DN127031_c0_g1_i5
Tw_Wn_DN118882_c0_g1_i2
Tw_Wn_DN121684_c0_g1_i2
Tw_Wn_DN127037_c0_g1_i1
Tw_Wn_DN134530_c2_g4_i7
Tw_Wn_DN117797_c0_g1_i1

the script says Died at script.pl line 14.

ADD REPLYlink written 2.4 years ago by majeedaasim40

The above script should work for input file provided below:

Fasta_file

>EG01
AATTCAGAAAAGTGCCATAAAACAGGCAAATATAGCACCTATCTTCGATTCCCGATTGAGAAAGCTCGAATCATTAAGAA
GATGTAGGTTGGGTTGAGGAACGAGACCCAGCAATTCGATTCCATGTTGGATTAAACAATATCAACCAAGAGTTTGATGT
>EG02
CAGGATAACTCCAATTTTTCTTTTAGAGTCAAGCAGCCATTACAAGGAAACACAATTAGATTGGGCTAGTCCTTACCTTA
AAAAGAATTGTCACAAGTTCCTTGGCTGTAAAAGGTTTTAATAGGAATGCTGCAACTCCACCACTTTTTTCTACTTCTGC

Sequence_id_file

EG01

Output_filename

>EG01
AATTCAGAAAAGTGCCATAAAACAGGCAAATATAGCACCTATCTTCGATTCCCGATTGAGAAAGCTCGAATCATTAAGAA
GATGTAGGTTGGGTTGAGGAACGAGACCCAGCAATTCGATTCCATGTTGGATTAAACAATATCAACCAAGAGTTTGATGT

Follow below usage to avoid any error

perl script usage : perl script.pl Sequence_id_file  Fasta_file Output_filename
ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Prakash1.9k
0
gravatar for Hugo
2.4 years ago by
Hugo290
Universidade de Vigo, Ourense (Spain)
Hugo290 wrote:

Hello,

we are currently developing SEDA http://www.sing-group.org/seda, a multi-platform application for processing and filtering FASTA files. I guess that what you need can be achieved with the "Pattern filtering" option, section 3.2 of the manual: http://www.sing-group.org/seda/downloads/manuals/seda-user-manual-1.0.0.pdf. In this option, you should select "Header" and then load your ID list using the "Import patterns" button. This way only sequences that contains a ID in your list will be kept.

Regards,

Hugo.

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Hugo290

Thanks, but link is not working

ADD REPLYlink written 2.4 years ago by majeedaasim40

They should be working now.

ADD REPLYlink written 2.4 years ago by Hugo290
0
gravatar for genomax
2.4 years ago by
genomax85k
United States
genomax85k wrote:

I am certain faSomeRecords utility from Jim Kent (UCSC) will do what you need.

ADD COMMENTlink written 2.4 years ago by genomax85k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1115 users visited in the last hour