Question

how to subset sequences based on IDs, but order shall remain preserved?

1

Entering edit mode

6.2 years ago

majeedaasim ▴ 60

I have a fasta file and an ID list file. I try to subset fasta sequences from the fasta file based on ID list using

seqtk subseq <fastafile> ID_list > out_file

However in the output file the order of the sequences is not preserved. I need the sequences in the outfile in the same order as in the ID list . Thanks

seqtk • 3.1k views

ADD COMMENT • link updated 6.2 years ago by GenoMax 141k • written 6.2 years ago by majeedaasim ▴ 60

score 0 · Answer 1 · 2018-01-30

0

Entering edit mode

6.2 years ago

Prakash ★ 2.2k

I hope this perl script will solve your purpose Usage : perl script.pl Id_file file.fasta Output.fasta

#!/usr/bin/perl

@file1 = `cat $ARGV[0]`;
@file2 = `cat $ARGV[1]`;
%hash;
$id ;
foreach $fasta(@file2){
        chomp($fasta);
        if($fasta =~ '>'){
                $id = $fasta;}
        if($fasta !~ '>'){
                push @{$hash{$id}},$fasta;      }
}
open(OUT,">$ARGV[2]") or die;
foreach $id(@file1){
        chomp($id);
        $id =~s/^/>/;
        #print "$id\n";
        @seq;
        print  OUT "$id\n";
        foreach $key(keys %hash){
                if($id eq $key){
                        @seq = @{$hash{$key}};}}
                for($i =0; $i<scalar(@seq);$i++){
                        print OUT "$seq[$i]\n";}
}

ADD COMMENT • link 6.2 years ago by Prakash ★ 2.2k

0

Entering edit mode

Thanks prakash but it produces an unusual result It doesn't subset properly

ADD REPLY • link 6.2 years ago by majeedaasim ▴ 60

0

Entering edit mode

If I am not wrong you need fasta sequence for selected Id ? right?

ADD REPLY • link 6.2 years ago by Prakash ★ 2.2k

0

Entering edit mode

yes I have fasta sequence file and an ID file. THe sequences need to be extracted from the sequence file according to the entries in the ID file but the order of sequences in the outfile shall be same as ID file entries.

ADD REPLY • link 6.2 years ago by majeedaasim ▴ 60

0

Entering edit mode

can you please provide few example of Id ?

ADD REPLY • link 6.2 years ago by Prakash ★ 2.2k

0

Entering edit mode

sequence file

>abc
atgcggatacagataca
>def
atgcggccaaaatt
>ghi
atggggaattcacaac
>jkl
aggatattatatccg

Now filter according to Ids list like

abc
jkl

ADD REPLY • link 6.2 years ago by majeedaasim ▴ 60

0

Entering edit mode

I have edited the script, I hope its working fine now

ADD REPLY • link 6.2 years ago by Prakash ★ 2.2k

0

Entering edit mode

I tried it but it produces the same contents as ID file. NO sequences are extracted.

ADD REPLY • link 6.2 years ago by majeedaasim ▴ 60

0

Entering edit mode

I think your Id file contains sequence id as >abc , >jkl. So accordingly i have edited the script. If that is the case , then the script should give intended result.

ADD REPLY • link 6.2 years ago by Prakash ★ 2.2k

0

Entering edit mode

My ID file entries are like

Tw_Wn_DN127031_c0_g1_i5
Tw_Wn_DN118882_c0_g1_i2
Tw_Wn_DN121684_c0_g1_i2
Tw_Wn_DN127037_c0_g1_i1
Tw_Wn_DN134530_c2_g4_i7
Tw_Wn_DN117797_c0_g1_i1

the script says Died at script.pl line 14.

ADD REPLY • link 6.2 years ago by majeedaasim ▴ 60

0

Entering edit mode

The above script should work for input file provided below:

Fasta_file

>EG01
AATTCAGAAAAGTGCCATAAAACAGGCAAATATAGCACCTATCTTCGATTCCCGATTGAGAAAGCTCGAATCATTAAGAA
GATGTAGGTTGGGTTGAGGAACGAGACCCAGCAATTCGATTCCATGTTGGATTAAACAATATCAACCAAGAGTTTGATGT
>EG02
CAGGATAACTCCAATTTTTCTTTTAGAGTCAAGCAGCCATTACAAGGAAACACAATTAGATTGGGCTAGTCCTTACCTTA
AAAAGAATTGTCACAAGTTCCTTGGCTGTAAAAGGTTTTAATAGGAATGCTGCAACTCCACCACTTTTTTCTACTTCTGC

Sequence_id_file

EG01

Output_filename

>EG01
AATTCAGAAAAGTGCCATAAAACAGGCAAATATAGCACCTATCTTCGATTCCCGATTGAGAAAGCTCGAATCATTAAGAA
GATGTAGGTTGGGTTGAGGAACGAGACCCAGCAATTCGATTCCATGTTGGATTAAACAATATCAACCAAGAGTTTGATGT

Follow below usage to avoid any error

perl script usage : perl script.pl Sequence_id_file  Fasta_file Output_filename

ADD REPLY • link 6.2 years ago by Prakash ★ 2.2k

score 0 · Answer 2 · 2018-01-30

0

Entering edit mode

6.2 years ago

Hugo ▴ 380

Hello,

we are currently developing SEDA http://www.sing-group.org/seda, a multi-platform application for processing and filtering FASTA files. I guess that what you need can be achieved with the "Pattern filtering" option, section 3.2 of the manual: http://www.sing-group.org/seda/downloads/manuals/seda-user-manual-1.0.0.pdf. In this option, you should select "Header" and then load your ID list using the "Import patterns" button. This way only sequences that contains a ID in your list will be kept.

Regards,

Hugo.

ADD COMMENT • link 6.2 years ago by Hugo ▴ 380

0

Entering edit mode

Thanks, but link is not working

ADD REPLY • link 6.2 years ago by majeedaasim ▴ 60

0

Entering edit mode

They should be working now.

ADD REPLY • link 6.2 years ago by Hugo ▴ 380

score 0 · Answer 3 · 2018-01-30

0

Entering edit mode

6.2 years ago

GenoMax 141k

I am certain faSomeRecords utility from Jim Kent (UCSC) will do what you need.

ADD COMMENT • link 6.2 years ago by GenoMax 141k