Question: Sort Sequences In A Fasta File According To The peg List
1
gravatar for k.kathirvel93
3.8 years ago by
k.kathirvel93210
India
k.kathirvel93210 wrote:

Hi, Guys I have a fasta file with thousands of sequences with peg IDs. Can anyone please help me to sort out the sequences in fasta file according to their peg ID....... Thanks in advance !!

.fasta file looks...

>fig|6666666.167416.peg.1
MYVAGHEGIELQPLSAADDAEARRLADEYFSRIDPAGR
>fig|6666666.167416.peg.3
MIAHASTPYSRKRGEPGPPHGPGLRRNEPHSGSTPFSL
>fig|6666666.167416.peg.2
MTHNQCLDLLESAEDTLDFLKSSLTYLGSPQKTENKAR
>fig|6666666.167416.peg.7
MHELQQALANLNTVLRRLDRNPAQYLLGGENIEETKP
>fig|6666666.167416.peg.4
MLLAAGRNRSARAAAREATGQDRCGGKRTGRKSGFPE
>fig|6666666.167416.peg.8
MLGHGRLQGSVRWRGAARKAVGGFLPSGLRPHHEISE
>fig|6666666.167416.peg.5
MDVRAPRAAPTGSDWRCRGRFSFFPRRSSFRSGARRL
>fig|6666666.167416.peg.6
MQIMVIEAMKEGADPEALLSSAQKVIDERTKELDKLD

The expected result should be like this :

>fig|6666666.167416.peg.1
MYVAGHEGIELQPLSAADDAEARRLADEYFSRIDPAGR
>fig|6666666.167416.peg.2
MTHNQCLDLLESAEDTLDFLKSSLTYLGSPQKTENKAR
>fig|6666666.167416.peg.3
MIAHASTPYSRKRGEPGPPHGPGLRRNEPHSGSTPFSL
>fig|6666666.167416.peg.4
MLLAAGRNRSARAAAREATGQDRCGGKRTGRKSGFPE
>fig|6666666.167416.peg.5
MDVRAPRAAPTGSDWRCRGRFSFFPRRSSFRSGARRL
>fig|6666666.167416.peg.6
MQIMVIEAMKEGADPEALLSSAQKVIDERTKELDKLD
>fig|6666666.167416.peg.7
MHELQQALANLNTVLRRLDRNPAQYLLGGENIEETKP
>fig|6666666.167416.peg.8
MLGHGRLQGSVRWRGAARKAVGGFLPSGLRPHHEISE
rna-seq next-gen sequence • 1.7k views
ADD COMMENTlink modified 14 months ago by RamRS25k • written 3.8 years ago by k.kathirvel93210

Thanks, Pierre.... I got linearised fasta with help of linearizefasta.awk. I am sry to say, Am new to awk. let me know what should I do next to get ouput like this....

The expected result should be like this :

>fig|6666666.167416.peg.1
MYVAGHEGIELQPLSAADDAEARRLADEYFSRIDPAGR
>fig|6666666.167416.peg.2
MTHNQCLDLLESAEDTLDFLKSSLTYLGSPQKTENKAR
>fig|6666666.167416.peg.3
MIAHASTPYSRKRGEPGPPHGPGLRRNEPHSGSTPFSL
>fig|6666666.167416.peg.4
MLLAAGRNRSARAAAREATGQDRCGGKRTGRKSGFPE
>fig|6666666.167416.peg.5
MDVRAPRAAPTGSDWRCRGRFSFFPRRSSFRSGARRL
>fig|6666666.167416.peg.6
MQIMVIEAMKEGADPEALLSSAQKVIDERTKELDKLD
>fig|6666666.167416.peg.7
MHELQQALANLNTVLRRLDRNPAQYLLGGENIEETKP
>fig|6666666.167416.peg.8
MLGHGRLQGSVRWRGAARKAVGGFLPSGLRPHHEISE

And I can't understand this awk command. Where should I mention my input file in this command.......Thanks again.....looking forward......

ADD REPLYlink modified 14 months ago by RamRS25k • written 3.8 years ago by k.kathirvel93210

You are giving your input file in the provided link input.fa. After replacing input.fa with your file name add the Pierre's answer (now you understand 3 dots --> linearizing fasta | answer below)

ADD REPLYlink modified 14 months ago by RamRS25k • written 3.8 years ago by venu6.3k

Dear Venu,

Thanks for your idea, but it's making problem for my data......it's not working proberly......will you suggest something else...

ADD REPLYlink modified 14 months ago by RamRS25k • written 3.8 years ago by k.kathirvel93210

it's not working properly- What is it showing? Any error on the screen?

ADD REPLYlink modified 14 months ago by RamRS25k • written 3.8 years ago by venu6.3k

Yes,

It's Running.....but the output.fa file is empty...

ADD REPLYlink written 3.8 years ago by k.kathirvel93210

But its working properly for me. You are using linux right? 

ADD REPLYlink written 3.8 years ago by venu6.3k

yes, am using ubuntu 14.04

ADD REPLYlink written 3.8 years ago by k.kathirvel93210
1
gravatar for Pierre Lindenbaum
3.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

linearize fasta

Use awk to extract the peg id (split by '.', get the last item) , sort , remove the first column containing the id and convert back to fasta

... | awk '{N=split($1,a,/\./); printf("%s\t%s\n",a[N],$0);}' | sort -t '(insert-a-tab-here)' -k1,1n | cut -f 2,3 | tr "\t" "\n"
ADD COMMENTlink modified 14 months ago by RamRS25k • written 3.8 years ago by Pierre Lindenbaum124k
1
gravatar for venu
3.8 years ago by
venu6.3k
Germany
venu6.3k wrote:

Extract ids and sort them

grep '^>' file.fa | sort > ids.txt

Then a simple bash loop

while read id
do
    grep -A1 "$id" file.fa >> sorted_fasta.fa
done < ids.txt

And use -m 1 with grep if it is taking a long time (It stops searching entire file if the id is found once).

ADD COMMENTlink modified 14 months ago by RamRS25k • written 3.8 years ago by venu6.3k
1
gravatar for geek_y
3.8 years ago by
geek_y10k
Barcelona
geek_y10k wrote:

This should also work:

grep '^>' file.fa | sed 's/^>//' | sort > ids.txt
samtools faidx file.fa `cat ids.txt` > new.fa

It should be very fast.

ADD COMMENTlink modified 14 months ago by RamRS25k • written 3.8 years ago by geek_y10k
1
gravatar for Matt Shirley
3.8 years ago by
Matt Shirley9.2k
Cambridge, MA
Matt Shirley9.2k wrote:

ADD COMMENTlink written 3.8 years ago by Matt Shirley9.2k

Dear Matt,

Here the Error, Pls check this out...

sorted_fasta.write('>' + peg_fasta[id].longname)
AttributeError: 'FastaRecord' object has no attribute 'longname'

What should I mention here in the longname position... Thanks again..

ADD REPLYlink modified 14 months ago by RamRS25k • written 3.8 years ago by k.kathirvel93210

Sorry - I forgot the underscore (I shouldn't have given two separate interfaces in the packages "longname" and "long_name"...) and the updated code should work. The FastaRecord.long_name property actually reads the full sequence name from the FASTA file, instead of the FastaRecord.name attribute which is the record's name from the .fai index file, which will lose anything after the first whitespace character.

ADD REPLYlink modified 14 months ago by RamRS25k • written 3.8 years ago by Matt Shirley9.2k
1
gravatar for 5heikki
3.8 years ago by
5heikki8.6k
Finland
5heikki8.6k wrote:

Assuming the data is as in your example (no line breaks in sequences and all headers are in fig|6666666.167416.peg.$number format)

for next in $(grep "^>" file.fa | sort -k4,4g -t "."); do grep -m 1 -wA 1 "$next" file.fa; done > sorted.fa

If you have proper GNU sort tool then:

for next in $(grep "^>" file.fa | sort --parallel $NumCoresYouHave -k4,4g -t "."); do grep -m 1 -wA 1 "$next" file.fa; done > sorted.fa
ADD COMMENTlink modified 14 months ago by RamRS25k • written 3.8 years ago by 5heikki8.6k
1
gravatar for John
3.8 years ago by
John12k
Germany
John12k wrote:

Just for fun, here's a superfast in-memory sort using numpy. It uses 16bytes per FASTA/FASTQ entry, and i'm confident you could get that down to just 12 or even 8 (depends on the input) with a minor tweak:

import numpy
import subprocess
inFile = "./someFile.fasta"
linesInFile = int(subprocess.check_output('wc -l ' + inFile, shell=True).strip().split()[0])
a = numpy.arange(linesInFile)                       # Struct to store line & ID
with open(inFile,'rb') as f:
    ID_ROW = False
    digits = str.isdigit                            # reduce lookups later on
    for lineNo,line in enumerate(f):
        ID_ROW = not ID_ROW                         # because its opposites day
        if ID_ROW:
            a[lineNo+1] = int(filter(digits, line)) # Keep the IDs numbers only
a.view('i8,i8').sort(order=['f1'], axis=0) # In-place sort rows by ID
a.view('i8,i8')['f0']                      # Order of rows in new file

On your demo data:

array([ 0,  4,  2,  8, 12, 14,  6, 10])

You don't really need the wc -l, since you can grow numpy arrays as you add data to them (manually or with MMU).

I haven't included a way to write out the sorted file as a new FASTA file because there are many ways to do that depending on the size of the data, what kind of hard drive you have, etc. Personally I would store the DNA in the numpy array too (in either 2bit or up2bit format), and then from the table you could write a new FASTA file at once. If that takes up too much RAM, one could find a tool (or write one) that takes an input file,a list of the order of the rows for the new file (which is what the above prints), and outputs the shuffled file.

ADD COMMENTlink modified 14 months ago by RamRS25k • written 3.8 years ago by John12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1044 users visited in the last hour