Question: How to find the longest common sequence for a cluster of sequences in a fasta file using python?
0
gravatar for grayapply2009
2.8 years ago by
grayapply2009160
United States
grayapply2009160 wrote:

I have a fasta file in which sequences are clustered and sorted by IDs. I want to find the longest sequence for each cluster and write them to a new file. How do I do it with python?

Here is the format of my fasta file:

>abc var1

kdfafaljflasjfalsjfaljfs

>abc var2

lasuowiejwaljflaj

>abc var3

lajflasjfowijflasjfopiefjjkfldfjqop

>dce var1

owiepqfpufaplddfpqoiwejlkdf

>dce var2

qopwelsmdfljfaldjfaopif

>red var1

alsdfowejfsladfjojflsdfjsdfjaslfjk

>red var2

lsdfjjqowjelsaflasflfnkdaflasfj

>red var3

kahfiqwuefkasdnkashdfiqfkasjdfh

>red var4

akhqioweadhauisydklsdfksdyiofjasldfhihladfni

longest common phthon fasta • 1.4k views
ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by grayapply2009160
1
gravatar for dbrowne.up
2.8 years ago by
dbrowne.up60
United States
dbrowne.up60 wrote:

Check out the Python module pyfaidx: https://github.com/mdshw5/pyfaidx

It makes doing this sort of thing super easy. You may have to experiment a bit to figure out how to do exactly what you are wanting to, but with pyfaidx, you have a nice interface to access each sequence in your file and get information about each sequence, i.e. length, name, etc.

ADD COMMENTlink written 2.8 years ago by dbrowne.up60

It looks like a lot of work. I'm trying it. Thank you for your advice.

ADD REPLYlink written 2.8 years ago by grayapply2009160

pyfaidx will not work on this type of FASTA because the indexing process splits each sequence name on whitespace, so you'd end up with non-unique identifiers. This was a design decision to match the samtools behavior.

ADD REPLYlink written 2.8 years ago by Matt Shirley8.7k
1

Thanks for pointing it out, Matt. I noticed that too. However, the integrated faidx commandline tool is really handy for doing other things with your fasta file. 

ADD REPLYlink written 2.8 years ago by grayapply2009160
1
gravatar for grayapply2009
2.8 years ago by
grayapply2009160
United States
grayapply2009160 wrote:

Hey folks,

I found a solution from another post. Here is the link for those who are in the same boat with me.

How to extract the longest isoform from multi fasta file

ADD COMMENTlink written 2.8 years ago by grayapply2009160
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1926 users visited in the last hour