Question: Sort Sequences In A Fasta File According To The Sequence List In Another Fasta/Txt File
0
gravatar for biostar
5.3 years ago by
biostar150
biostar150 wrote:

Hi Guys, I have a fasta file with thousands of sequences and another text file for the sequence ids ( each id in a new line and they are the list of same sequences but in different order than the sequences in the fasta file). Could your guys please help me sort the sequences in fasta file according to the ID list in different text file? Thanks!!

.fasta file looks like this:

>Seq3
ACTTTTGATACAATTAACAGGACGAAAATAATAGAAAAGCTAAAGCATCTTAGAATCCCA
>Seq4
AATCCCAGACAAATTAAGACATATTCTAACAGTGAGTCTACAGAACACAGAACACTATAG
>Seq1
AGTTTTGCAATGGTAAATTATTTTGAAGAGTTTATAGGTCGTGTCTGGAACTGCAATTAT
>Seq2
TGGAATATTAGACGAATTCCATACACAGCACCTATTGTAATATTCATAGATTTCAAAAGC

The .txt file for Seq IDs looks like this:

Seq1
Seq2
Seq3
Seq4

The expected result should be:

>Seq1
AGTTTTGCAATGGTAAATTATTTTGAAGAGTTTATAGGTCGTGTCTGGAACTGCAATTAT
>Seq2
TGGAATATTAGACGAATTCCATACACAGCACCTATTGTAATATTCATAGATTTCAAAAGC
>Seq3
ACTTTTGATACAATTAACAGGACGAAAATAATAGAAAAGCTAAAGCATCTTAGAATCCCA
>Seq4
AATCCCAGACAAATTAAGACATATTCTAACAGTGAGTCTACAGAACACAGAACACTATAG
fasta sequence • 8.3k views
ADD COMMENTlink modified 4.5 years ago by Biostar ♦♦ 20 • written 5.3 years ago by biostar150

Just to be clear: the IDs in the fasta file are integers, but in the related text file they are prefixed with "Seq"?

ADD REPLYlink written 5.3 years ago by Neilfws48k

Sorry, the IDs are both alpha-numeric and I corrected that above.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by biostar150
1
gravatar for Pavel Senin
5.3 years ago by
Pavel Senin1.9k
Los Alamos, NM
Pavel Senin1.9k wrote:

That was answered before, Extracting Sequence From A 3Gb Fasta File, but anyway, why not to use an index for thousands of sequences?

use strict;
use Bio::Index::Fasta;

# file names
#
my $In_Fasta_File_Name = "test.fa";
my $List_File_Name     = "list.txt";

#
# make index
#
my $Index_File_Name = "tmp.idx";
my $idx             = Bio::Index::Fasta->new(
 '-filename'   => $Index_File_Name,
 '-write_flag' => 1
);
$idx->make_index($In_Fasta_File_Name);

#
# open the list
#
open( my $list, $List_File_Name ) or die "Could not open $List_File_Name !";

#
# write to stdout using list and index
#
my $out = Bio::SeqIO->new( '-format' => 'Fasta', '-fh' => \*STDOUT );
while ( my $id = <$list> ) {
 chomp $id;
 my $seq = $idx->fetch($id); 
 $out->write_seq($seq);
}

and the output

>Seq1
AGTTTTGCAATGGTAAATTATTTTGAAGAGTTTATAGGTCGTGTCTGGAACTGCAATTAT
>Seq2
TGGAATATTAGACGAATTCCATACACAGCACCTATTGTAATATTCATAGATTTCAAAAGC
>Seq3
ACTTTTGATACAATTAACAGGACGAAAATAATAGAAAAGCTAAAGCATCTTAGAATCCCA
>Seq4
AATCCCAGACAAATTAAGACATATTCTAACAGTGAGTCTACAGAACACAGAACACTATAG
ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by Pavel Senin1.9k

Thanks Pavel! Thanks to everyone who contributed.

ADD REPLYlink written 5.3 years ago by biostar150
0
gravatar for AndreiR
5.3 years ago by
AndreiR240
São Paulo
AndreiR240 wrote:

A bash for loop?

for i in cat .txt|sed 's/Seq//'; do cat .fasta| grep ">$i" -A1 >> .sorted_fasta; done

ADD COMMENTlink written 5.3 years ago by AndreiR240
2

It might be easier to
while read ID; do grep -A1 ">$ID" FASTA ; done < ID_FILE

ADD REPLYlink written 5.3 years ago by PoGibas4.8k
2

Tip: You can substantially speed up "grep" by using "LC_ALL=C grep" instead. See for example here.

ADD REPLYlink written 5.3 years ago by Philipp Bayer6.0k
2

and by using -m 1 to avoid searching the rest of the file once the first (and only hit?) has been found. Notice also the ^ symbol to limit the search to the beginning of the line: while read ID ; do grep -m 1 -A 1 "^>$ID" FASTA ; done < ID_FILE

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Frédéric Mahé2.9k

There is a mistake in your grep : "^>$ID$" You need to do an exact match for the ID otherwise it will match with "22" when searching "2".

Moreover none of your solutions in bash work with multiline sequences. Here is a small working solution :

cat $fasta | grep "^>" | sort | while read ID ; do awk 'BEGIN{RS=">"; ORS="";} /^'${ID:1}'/{print ">" $0; exit(0);}' $fasta ; done

It sorts the ids and for each id, it extracts the sequence from fasta file.

ADD REPLYlink modified 13 months ago • written 13 months ago by veyssier0

A faster version : cat $fasta | awk 'BEGIN{RS=">"; FS="\n"; curl=1;} NR>1{print $1 " " curl " " curl+NF-2; curl=curl+NF-1}' | sort -n -k 1,1 | while IFS=' ' read -ra TAB; do beg=${TAB[1]}; end=${TAB[2]}; sed -n $beg','$end'p' $fasta; done It first generated lines like that : ID line_begin line_end. It sorts them by first column (ID) and for each of those sorted lines, it prints corresponding sequence with ID.

I also found out there was an amazingly fast tool named fastasort in exonerate suite. It's available in Debian/Ubuntu repositories or there : https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate

fastasort runs approximately 100x faster than my script.

ADD REPLYlink modified 13 months ago • written 13 months ago by veyssier0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1165 users visited in the last hour