Question: Post Your Prefered Bioinformatics Short Code
10
gravatar for Eric Normandeau
9.7 years ago by
Quebec, Canada
Eric Normandeau10k wrote:

Hi,

I'm interested in seeing your favorite bit of code that you programmed yourself for bioinformatics purposes.

It can be a function, class, parser, you name it! Mention the language you coded it in.

Cheers to you bioinformaticians!

code • 3.4k views
ADD COMMENTlink modified 4.6 years ago by michael.ante3.5k • written 9.7 years ago by Eric Normandeau10k
1

you should ask for any neat code not just Python

ADD REPLYlink written 9.7 years ago by Istvan Albert ♦♦ 81k

You are probably right. This is a bit narrow for a forum of bioinformaticiant regrouping 130 people :)

ADD REPLYlink written 9.7 years ago by Eric Normandeau10k

I thought this would be about preferred acronyms for bioinformatics; mine's bfx

ADD REPLYlink written 4.6 years ago by russhh4.8k
9
gravatar for Paulo Nuin
9.7 years ago by
Paulo Nuin3.7k
Canada
Paulo Nuin3.7k wrote:
grep -c "^>" [fasta_file]
ADD COMMENTlink modified 15 months ago by RamRS25k • written 9.7 years ago by Paulo Nuin3.7k
6

dont forget: grep ">" some.fasta shows you the headers in your fasta file. While grep > some.fasta overwrites your fasta file. (not that i ever made that mistake, ahem).

ADD REPLYlink modified 15 months ago by RamRS25k • written 9.7 years ago by brentp23k
1

Well, this certainly qualifies for the "short" portion of the question :P

ADD REPLYlink written 9.7 years ago by Eric Normandeau10k
1

I was hoping there was a "like" button.

ADD REPLYlink written 9.7 years ago by Zhaorong1.2k
1

This almost sounds as bad as a bash script labeled "Unix Russian Roulette" I came across some time ago. The script was basically a short one liner that gave you a one in six chance (if you can call it that) of erasing your disk. I was linux-naive enough to try it and, luckylly, I didn't put "sudo" in front of it to gain admin rights... [sic]

ADD REPLYlink written 9.7 years ago by Eric Normandeau10k

Too much Facebook does that! To express the fact that you 'like' this post (but you probably know this :) press the grey arrow on top of the digit displaying on the left side of the question :) Cheers!

ADD REPLYlink written 9.7 years ago by Eric Normandeau10k
3
gravatar for Istvan Albert
9.7 years ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

My favorite codes are those that allow me to use powerful algorithms directly from my favorite language without any penalty. Here is such a code using the bx-python intersecter (a module written by Brent Pedersen). It queries 100,000 times a prebuilt data structure containing 100,000 reference intervals. Each interval is of a random size of up to 10,000 spanning a region of 10 million. And it only takes ... drumroll ... 7 seconds, thus performing at around 50 ms per query!

import time, random
from itertools import imap
from bx.intervals.intersection import Intersecter, Interval

def random_interval(x):
    start = random.randint(1, 10**7) 
    end = start + random.randint(1, 10**4) 
    return start, end

# create 100K reference intervals
source_intervals = imap(random_interval, xrange(10**5))

print "Building:",
tick = time.time()
inter = Intersecter()
for i, ( start, end ) in enumerate( source_intervals ):
    inter.add_interval( Interval( start, end, i ) )
print time.time( ) - tick

# create 100K query intervals
query_intervals = imap(random_interval, xrange(10**5))

print "Using:",
tick = time.time()
for start, end in query_intervals:
    inter.find( start, end )
print time.time( ) - tick

The output is:

---------- Python 2.5 ----------
Building: 2.11999988556
Using: 5.25200009346

Output completed (7 sec consumed) - Normal Termination
ADD COMMENTlink modified 15 months ago by RamRS25k • written 9.7 years ago by Istvan Albert ♦♦ 81k

cool! glad others are using that. i should say that someone else came up with the initial pure python version in bx-python, i just tuned it up and cython-ized it (and added a lot of tests).

ADD REPLYlink written 9.7 years ago by brentp23k
2
gravatar for Khader Shameer
9.7 years ago by
Manhattan, NY
Khader Shameer18k wrote:
chomp (@a = <cat list.txt>); 
foreach $a(@a) 
{
<whatever> system commands*; 
}

Most of my project starts with a list of IDs, first step will be mostly a BLAST or hmmpfam search or to run a custom-program using files in list.txt

ADD COMMENTlink modified 15 months ago by RamRS25k • written 9.7 years ago by Khader Shameer18k
2

This is perl, right? You have identical features with bash.

ADD REPLYlink written 9.7 years ago by Paulo Nuin3.7k

Yes Nuin, this is probably a piece of code that I have used in all my projects.

ADD REPLYlink written 9.7 years ago by Khader Shameer18k
2
gravatar for Nico
9.7 years ago by
Nico180
France
Nico180 wrote:

I think clear help function is really important for users. So, even if it is not a bioinformatic code, this is a useful code for me. It is a small code which reproduce the "man" aspect.

import sys

print """
test(1)                           User Commands                         test(1)

\033[1mNAME\033[0m
\t%s - small description

\033[1mSYNOPSIS\033[0m
\t\033[1mpython %s \033[0m[\033[4mOPTION\033[0m]... [\033[4mFILE\033[0m]...

\033[1mDESCRIPTION\033[0m
\tLong description

\twith differents paragraphs

\t\033[1m-f, --file\033[0m
\t\tThe file...

\t\033[1m-h, --help\033[0m
\t\tDisplay the man of this program

\033[1mAUTHOR\033[0m
\tWritten by Nico.
""" % (sys.argv[0],sys.argv[0])

A screenshot to show result:

alt text

It is definitely NOT a bioinformatic code... ^^'

ADD COMMENTlink modified 15 months ago by RamRS25k • written 9.7 years ago by Nico180
1
gravatar for Eric Normandeau
9.7 years ago by
Quebec, Canada
Eric Normandeau10k wrote:

Here is my own example. Although simplistic, I have used the following FASTA parser numerous times. Coded quick and dirty in Python, it produces a list of lists. Each of the inner lists contains two strings, one with the sequence name and the other with the sequence itself. Here it is:

def read_fasta(path):

    """
    Read FASTA file into a list of lists.
    Each inner list contains a name and a sequence.

    """
    out = []
    last_record = -1
    with open(path) as file_:
        for line in file_:
            if line.startswith("&gt;"):
                contig_name = line.split()[0]
                contig_seq = ""
                out.append([contig_name, contig_seq])
            else:
                out[last_record][1] += line.rstrip()
    return out

Cheers!

ADD COMMENTlink modified 15 months ago by RamRS25k • written 9.7 years ago by Eric Normandeau10k
2

FYI: for a high performance python based Fasta library see PyFasta, pypi.python.org/pypi/pyfasta
I wish I would have heard of this package sooner

ADD REPLYlink written 9.7 years ago by Istvan Albert ♦♦ 81k
1

Python string concatenation is very inefficient and this is especially so for long strings. You would be much better off storing a list of lines and then joining them at the end.

ADD REPLYlink written 8.1 years ago by Gww2.7k

This code is elegant in being able to return full records while avoiding an additional conditional or having to flush at the end. It would help its readability it if you renamed line_counter to last_record (or maybe not even name it at all just keep it -1).

ADD REPLYlink written 9.7 years ago by Istvan Albert ♦♦ 81k

You are right. Line counter is an artifact from an earlier attempt with a different twist. I changed it to last_record as you suggest, since this is exactly what it stands for.

ADD REPLYlink written 9.7 years ago by Eric Normandeau10k

FYI: for a high performance python based Fasta library see PyFasta, http://pypi.python.org/pypi/pyfasta/ I for one wish I heard of this package sooner.

ADD REPLYlink written 9.7 years ago by Istvan Albert ♦♦ 81k

I don't like it very much: there are a lot of fasta format parser already and I feel it is not good to write a new one, if not for didactic purposes. Moreover, you should write at least a few tests to demonstrate that your code is working correctly.

ADD REPLYlink written 9.7 years ago by Giovanni M Dall'Olio26k

We can be fairly sure the code is tested - only that this is not the right place to show/distribute them. As for FASTA parsing, indeed there are many choices, though the vast majority of them have substantial limitations. Although a seemingly simple task, the problem of parsing a FASTA file not nearly as easy to solve as most people assume. Take your FASTA reader; use it to extract the last 100 bases of chrom1 of the human genome. Now use the same FASTA reader to count 50 million short reads with it. You will soon see that most implementation fail spectacularly at one or both of these tasks.

ADD REPLYlink written 9.7 years ago by Istvan Albert ♦♦ 81k

I see very justified critics concerning my FASTA script. Maybe I can offer an explanation as to why it figures here :) The main reason why this is my FAVORITE (which is the question) is that this specific script has been at the core of my Python learning process. Python has opened tons of interesting applications for me in my everyday work. It is also an example of how doing complex things (here at best 'lightly' complex) is rendered easy and even beautiful with some scripting in your favorite language. I do not even suggest you use this. As mentioned, there are dependable parsers out there :)

ADD REPLYlink written 9.7 years ago by Eric Normandeau10k

If you want to learn more, check out this post on using itertools.groupby for FASTA loading.

ADD REPLYlink modified 3 months ago by RamRS25k • written 9.7 years ago by Perry290

@Perry Wow, your discussion is pretty interesting. Although I feel pretty confident whenever I do some Python coding, I admit that my knowledge of the standard library is too limited. This gives me one more good reason do dive into itertools! Thanks

ADD REPLYlink written 9.7 years ago by Eric Normandeau10k
1
gravatar for michael.ante
4.6 years ago by
michael.ante3.5k
Austria/Vienna
michael.ante3.5k wrote:

For batch processing I like to use GNU parallel, especially for single-thread tools. For instance:

parallel --results gene_counts -j 8 htseq-count -f bam -r pos {} my.gtf ::: *bam

It stores in gene_counts/1 for each bam-file the htseq-count sdtout and stderr output.

ADD COMMENTlink modified 15 months ago by RamRS25k • written 4.6 years ago by michael.ante3.5k
0
gravatar for Prakki Rama
4.6 years ago by
Prakki Rama2.3k
Singapore
Prakki Rama2.3k wrote:
sed -i 's/>chr/>Chr/g' genome.fasta
ADD COMMENTlink modified 15 months ago by RamRS25k • written 4.6 years ago by Prakki Rama2.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 731 users visited in the last hour