Question: Efficient Sequence Length Comparison
gravatar for alexander.trouern-trend
11 months ago by

Hello BioStars,

A fun learning opportunity was presented to me by an obstacle in the road to completing my transcriptomics project. The underlying question is: How do you EFFICIENTLY find the longest sequence for a large set of orthologous gene family groups (Orthogroups)?

Two disclaimers: 1.) I am a coding monolinguist (for now, just py!) 2.) I have a little computer science background

To start off, after running OrthoFinder to cluster a reference transcriptome into Orthogroups, I have a couple files important to this task.

    orthogroups.csv  - Column 0 has the name of  Orthogroup, Column 1 has a list of transcripts from the reference transcriptome that have clustered together to form that orthogroup. There are roughly 64,000 orthogroups.

    reference_transcriptome.fasta - Roughly 439,000 fasta sequences from several related species.

FIRST APPROACH: For each orthogroup in orthogroups.csv; I got a list of transcripts that needed to be compared. For each transcript; I went into reference_transcriptome.fasta, found the sequence For that sequence; I measured the length using Seq.IO from biopython and stored these for comparison. Print name of orthogroup and longest transcript found by comparing stored sequence/seq length pairs

After watching this brute force approach go for a day or two with little progress I rethought the problem...

SECOND APPROACH: 1.) Using SeqIO from Biopython, I created a table with the name of each header in reference_transcriptome.fasta and it's sequence length.

2.) For each Orthogroup
         longestseq_len = 0
         longestseq_name = ""
         for each transcript;
             access transcript header in table
                 if length of transcript >= longestseq_len;
                     length = longestseq_len
                     header = longestseq_name
         print name_of_orthogroup + longestseq_name

This second approach works in a "reasonable" amount of time, but I'd bet that there is a method out there that can accomplish this in minutes if not seconds.

If anybody has insight about how to do this faster, please let me know!

Additionally, any tips (for posting, communicating code ideas, comments about absent necessary info/ present unnecessary info) would be greatly appreciated.

Thanks, Alex

ADD COMMENTlink written 11 months ago by alexander.trouern-trend0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1240 users visited in the last hour