Post Your Prefered Bioinformatics Short Code
7
10
Entering edit mode
13.0 years ago

Hi,

I'm interested in seeing your favorite bit of code that you programmed yourself for bioinformatics purposes.

It can be a function, class, parser, you name it! Mention the language you coded it in.

Cheers to you bioinformaticians!

code • 4.5k views
1
Entering edit mode

you should ask for any neat code not just Python

0
Entering edit mode

You are probably right. This is a bit narrow for a forum of bioinformaticiant regrouping 130 people :)

0
Entering edit mode

I thought this would be about preferred acronyms for bioinformatics; mine's bfx

9
Entering edit mode
13.0 years ago
Paulo Nuin ★ 3.7k
grep -c "^>" [fasta_file]

6
Entering edit mode

dont forget: grep ">" some.fasta shows you the headers in your fasta file. While grep > some.fasta overwrites your fasta file. (not that i ever made that mistake, ahem).

1
Entering edit mode

Well, this certainly qualifies for the "short" portion of the question :P

1
Entering edit mode

I was hoping there was a "like" button.

1
Entering edit mode

This almost sounds as bad as a bash script labeled "Unix Russian Roulette" I came across some time ago. The script was basically a short one liner that gave you a one in six chance (if you can call it that) of erasing your disk. I was linux-naive enough to try it and, luckylly, I didn't put "sudo" in front of it to gain admin rights... [sic]

0
Entering edit mode

Too much Facebook does that! To express the fact that you 'like' this post (but you probably know this :) press the grey arrow on top of the digit displaying on the left side of the question :) Cheers!

3
Entering edit mode
13.0 years ago

My favorite codes are those that allow me to use powerful algorithms directly from my favorite language without any penalty. Here is such a code using the bx-python intersecter (a module written by Brent Pedersen). It queries 100,000 times a prebuilt data structure containing 100,000 reference intervals. Each interval is of a random size of up to 10,000 spanning a region of 10 million. And it only takes ... drumroll ... 7 seconds, thus performing at around 50 ms per query!

import time, random
from itertools import imap
from bx.intervals.intersection import Intersecter, Interval

def random_interval(x):
start = random.randint(1, 10**7)
end = start + random.randint(1, 10**4)
return start, end

# create 100K reference intervals
source_intervals = imap(random_interval, xrange(10**5))

print "Building:",
tick = time.time()
inter = Intersecter()
for i, ( start, end ) in enumerate( source_intervals ):
inter.add_interval( Interval( start, end, i ) )
print time.time( ) - tick

# create 100K query intervals
query_intervals = imap(random_interval, xrange(10**5))

print "Using:",
tick = time.time()
for start, end in query_intervals:
inter.find( start, end )
print time.time( ) - tick


The output is:

---------- Python 2.5 ----------
Building: 2.11999988556
Using: 5.25200009346

Output completed (7 sec consumed) - Normal Termination

0
Entering edit mode

cool! glad others are using that. i should say that someone else came up with the initial pure python version in bx-python, i just tuned it up and cython-ized it (and added a lot of tests).

2
Entering edit mode
12.9 years ago
chomp (@a = <cat list.txt>);
foreach \$a(@a)
{
<whatever> system commands*;
}


Most of my project starts with a list of IDs, first step will be mostly a BLAST or hmmpfam search or to run a custom-program using files in list.txt

2
Entering edit mode

This is perl, right? You have identical features with bash.

0
Entering edit mode

Yes Nuin, this is probably a piece of code that I have used in all my projects.

2
Entering edit mode
12.9 years ago
Nico ▴ 180

I think clear help function is really important for users. So, even if it is not a bioinformatic code, this is a useful code for me. It is a small code which reproduce the "man" aspect.

import sys

print """
test(1)                           User Commands                         test(1)

\033[1mNAME\033[0m
\t%s - small description

\033[1mSYNOPSIS\033[0m
\t\033[1mpython %s \033[0m[\033[4mOPTION\033[0m]... [\033[4mFILE\033[0m]...

\033[1mDESCRIPTION\033[0m
\tLong description

\twith differents paragraphs

\t\033[1m-f, --file\033[0m
\t\tThe file...

\t\033[1m-h, --help\033[0m
\t\tDisplay the man of this program

\033[1mAUTHOR\033[0m
\tWritten by Nico.
""" % (sys.argv[0],sys.argv[0])


A screenshot to show result:

It is definitely NOT a bioinformatic code... ^^'

1
Entering edit mode
13.0 years ago

Here is my own example. Although simplistic, I have used the following FASTA parser numerous times. Coded quick and dirty in Python, it produces a list of lists. Each of the inner lists contains two strings, one with the sequence name and the other with the sequence itself. Here it is:

def read_fasta(path):

"""
Read FASTA file into a list of lists.
Each inner list contains a name and a sequence.

"""
out = []
last_record = -1
with open(path) as file_:
for line in file_:
if line.startswith("&gt;"):
contig_name = line.split()[0]
contig_seq = ""
out.append([contig_name, contig_seq])
else:
out[last_record][1] += line.rstrip()
return out


Cheers!

2
Entering edit mode

FYI: for a high performance python based Fasta library see PyFasta, pypi.python.org/pypi/pyfasta
I wish I would have heard of this package sooner

1
Entering edit mode

Python string concatenation is very inefficient and this is especially so for long strings. You would be much better off storing a list of lines and then joining them at the end.

0
Entering edit mode

This code is elegant in being able to return full records while avoiding an additional conditional or having to flush at the end. It would help its readability it if you renamed line_counter to last_record (or maybe not even name it at all just keep it -1).

0
Entering edit mode

You are right. Line counter is an artifact from an earlier attempt with a different twist. I changed it to last_record as you suggest, since this is exactly what it stands for.

0
Entering edit mode

FYI: for a high performance python based Fasta library see PyFasta, http://pypi.python.org/pypi/pyfasta/ I for one wish I heard of this package sooner.

0
Entering edit mode

I don't like it very much: there are a lot of fasta format parser already and I feel it is not good to write a new one, if not for didactic purposes. Moreover, you should write at least a few tests to demonstrate that your code is working correctly.

0
Entering edit mode

We can be fairly sure the code is tested - only that this is not the right place to show/distribute them. As for FASTA parsing, indeed there are many choices, though the vast majority of them have substantial limitations. Although a seemingly simple task, the problem of parsing a FASTA file not nearly as easy to solve as most people assume. Take your FASTA reader; use it to extract the last 100 bases of chrom1 of the human genome. Now use the same FASTA reader to count 50 million short reads with it. You will soon see that most implementation fail spectacularly at one or both of these tasks.

0
Entering edit mode

I see very justified critics concerning my FASTA script. Maybe I can offer an explanation as to why it figures here :) The main reason why this is my FAVORITE (which is the question) is that this specific script has been at the core of my Python learning process. Python has opened tons of interesting applications for me in my everyday work. It is also an example of how doing complex things (here at best 'lightly' complex) is rendered easy and even beautiful with some scripting in your favorite language. I do not even suggest you use this. As mentioned, there are dependable parsers out there :)

0
Entering edit mode

0
Entering edit mode

@Perry Wow, your discussion is pretty interesting. Although I feel pretty confident whenever I do some Python coding, I admit that my knowledge of the standard library is too limited. This gives me one more good reason do dive into itertools! Thanks

1
Entering edit mode
7.9 years ago
michael.ante ★ 3.8k

For batch processing I like to use GNU parallel, especially for single-thread tools. For instance:

parallel --results gene_counts -j 8 htseq-count -f bam -r pos {} my.gtf ::: *bam


It stores in gene_counts/1 for each bam-file the htseq-count sdtout and stderr output.

0
Entering edit mode
7.9 years ago
Prakki Rama ★ 2.6k
sed -i 's/>chr/>Chr/g' genome.fasta