Sort Sequences In A Fasta File According To The peg List
6
1
Entering edit mode
8.7 years ago
k.kathirvel93 ▴ 310

Hi, Guys I have a fasta file with thousands of sequences with peg IDs. Can anyone please help me to sort out the sequences in fasta file according to their peg ID....... Thanks in advance !!

.fasta file looks...

>fig|6666666.167416.peg.1
MYVAGHEGIELQPLSAADDAEARRLADEYFSRIDPAGR
>fig|6666666.167416.peg.3
MIAHASTPYSRKRGEPGPPHGPGLRRNEPHSGSTPFSL
>fig|6666666.167416.peg.2
MTHNQCLDLLESAEDTLDFLKSSLTYLGSPQKTENKAR
>fig|6666666.167416.peg.7
MHELQQALANLNTVLRRLDRNPAQYLLGGENIEETKP
>fig|6666666.167416.peg.4
MLLAAGRNRSARAAAREATGQDRCGGKRTGRKSGFPE
>fig|6666666.167416.peg.8
MLGHGRLQGSVRWRGAARKAVGGFLPSGLRPHHEISE
>fig|6666666.167416.peg.5
MDVRAPRAAPTGSDWRCRGRFSFFPRRSSFRSGARRL
>fig|6666666.167416.peg.6
MQIMVIEAMKEGADPEALLSSAQKVIDERTKELDKLD

The expected result should be like this :

>fig|6666666.167416.peg.1
MYVAGHEGIELQPLSAADDAEARRLADEYFSRIDPAGR
>fig|6666666.167416.peg.2
MTHNQCLDLLESAEDTLDFLKSSLTYLGSPQKTENKAR
>fig|6666666.167416.peg.3
MIAHASTPYSRKRGEPGPPHGPGLRRNEPHSGSTPFSL
>fig|6666666.167416.peg.4
MLLAAGRNRSARAAAREATGQDRCGGKRTGRKSGFPE
>fig|6666666.167416.peg.5
MDVRAPRAAPTGSDWRCRGRFSFFPRRSSFRSGARRL
>fig|6666666.167416.peg.6
MQIMVIEAMKEGADPEALLSSAQKVIDERTKELDKLD
>fig|6666666.167416.peg.7
MHELQQALANLNTVLRRLDRNPAQYLLGGENIEETKP
>fig|6666666.167416.peg.8
MLGHGRLQGSVRWRGAARKAVGGFLPSGLRPHHEISE
RNA-Seq sequence next-gen • 4.1k views
ADD COMMENT
0
Entering edit mode

Thanks, Pierre.... I got linearised fasta with help of linearizefasta.awk. I am sry to say, Am new to awk. let me know what should I do next to get ouput like this....

The expected result should be like this :

>fig|6666666.167416.peg.1
MYVAGHEGIELQPLSAADDAEARRLADEYFSRIDPAGR
>fig|6666666.167416.peg.2
MTHNQCLDLLESAEDTLDFLKSSLTYLGSPQKTENKAR
>fig|6666666.167416.peg.3
MIAHASTPYSRKRGEPGPPHGPGLRRNEPHSGSTPFSL
>fig|6666666.167416.peg.4
MLLAAGRNRSARAAAREATGQDRCGGKRTGRKSGFPE
>fig|6666666.167416.peg.5
MDVRAPRAAPTGSDWRCRGRFSFFPRRSSFRSGARRL
>fig|6666666.167416.peg.6
MQIMVIEAMKEGADPEALLSSAQKVIDERTKELDKLD
>fig|6666666.167416.peg.7
MHELQQALANLNTVLRRLDRNPAQYLLGGENIEETKP
>fig|6666666.167416.peg.8
MLGHGRLQGSVRWRGAARKAVGGFLPSGLRPHHEISE

And I can't understand this awk command. Where should I mention my input file in this command.......Thanks again.....looking forward......

ADD REPLY
0
Entering edit mode

You are giving your input file in the provided link input.fa. After replacing input.fa with your file name add the Pierre's answer (now you understand 3 dots --> linearizing fasta | answer below)

ADD REPLY
0
Entering edit mode

Dear Venu,

Thanks for your idea, but it's making problem for my data......it's not working proberly......will you suggest something else...

ADD REPLY
0
Entering edit mode

it's not working properly- What is it showing? Any error on the screen?

ADD REPLY
0
Entering edit mode

Yes,

It's Running.....but the output.fa file is empty...

ADD REPLY
0
Entering edit mode

But it's working properly for me. You are using linux right?

ADD REPLY
0
Entering edit mode

yes, am using ubuntu 14.04

ADD REPLY
1
Entering edit mode
8.7 years ago

linearize fasta

Use awk to extract the peg id (split by '.', get the last item) , sort , remove the first column containing the id and convert back to fasta

... | awk '{N=split($1,a,/\./); printf("%s\t%s\n",a[N],$0);}' | sort -t '(insert-a-tab-here)' -k1,1n | cut -f 2,3 | tr "\t" "\n"
ADD COMMENT
1
Entering edit mode
8.7 years ago
venu 7.1k

Extract ids and sort them

grep '^>' file.fa | sort > ids.txt

Then a simple bash loop

while read id
do
    grep -A1 "$id" file.fa >> sorted_fasta.fa
done < ids.txt

And use -m 1 with grep if it is taking a long time (It stops searching entire file if the id is found once).

ADD COMMENT
1
Entering edit mode
8.7 years ago

This should also work:

grep '^>' file.fa | sed 's/^>//' | sort > ids.txt
samtools faidx file.fa `cat ids.txt` > new.fa

It should be very fast.

ADD COMMENT
1
Entering edit mode
8.7 years ago

ADD COMMENT
0
Entering edit mode

Dear Matt,

Here the Error, Pls check this out...

sorted_fasta.write('>' + peg_fasta[id].longname)
AttributeError: 'FastaRecord' object has no attribute 'longname'

What should I mention here in the longname position... Thanks again..

ADD REPLY
0
Entering edit mode

Sorry - I forgot the underscore (I shouldn't have given two separate interfaces in the packages "longname" and "long_name"...) and the updated code should work. The FastaRecord.long_name property actually reads the full sequence name from the FASTA file, instead of the FastaRecord.name attribute which is the record's name from the .fai index file, which will lose anything after the first whitespace character.

ADD REPLY
1
Entering edit mode
8.7 years ago
5heikki 11k

Assuming the data is as in your example (no line breaks in sequences and all headers are in fig|6666666.167416.peg.$number format)

for next in $(grep "^>" file.fa | sort -k4,4g -t "."); do grep -m 1 -wA 1 "$next" file.fa; done > sorted.fa

If you have proper GNU sort tool then:

for next in $(grep "^>" file.fa | sort --parallel $NumCoresYouHave -k4,4g -t "."); do grep -m 1 -wA 1 "$next" file.fa; done > sorted.fa
ADD COMMENT
1
Entering edit mode
8.7 years ago
John 13k

Just for fun, here's a superfast in-memory sort using numpy. It uses 16bytes per FASTA/FASTQ entry, and i'm confident you could get that down to just 12 or even 8 (depends on the input) with a minor tweak:

import numpy
import subprocess
inFile = "./someFile.fasta"
linesInFile = int(subprocess.check_output('wc -l ' + inFile, shell=True).strip().split()[0])
a = numpy.arange(linesInFile)                       # Struct to store line & ID
with open(inFile,'rb') as f:
    ID_ROW = False
    digits = str.isdigit                            # reduce lookups later on
    for lineNo,line in enumerate(f):
        ID_ROW = not ID_ROW                         # because its opposites day
        if ID_ROW:
            a[lineNo+1] = int(filter(digits, line)) # Keep the IDs numbers only
a.view('i8,i8').sort(order=['f1'], axis=0) # In-place sort rows by ID
a.view('i8,i8')['f0']                      # Order of rows in new file

On your demo data:

array([ 0,  4,  2,  8, 12, 14,  6, 10])

You don't really need the wc -l, since you can grow numpy arrays as you add data to them (manually or with MMU).

I haven't included a way to write out the sorted file as a new FASTA file because there are many ways to do that depending on the size of the data, what kind of hard drive you have, etc. Personally I would store the DNA in the numpy array too (in either 2bit or up2bit format), and then from the table you could write a new FASTA file at once. If that takes up too much RAM, one could find a tool (or write one) that takes an input file,a list of the order of the rows for the new file (which is what the above prints), and outputs the shuffled file.

ADD COMMENT

Login before adding your answer.

Traffic: 953 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6