Question: Bioinformatics "Cheat Sheet"
 
36
 
 

Inspired by Keith Robison's post on "cheat sheets", what would you put on a cheat sheet for bioinformatics? This might include one-line scripts, conversion factors, handy rules of thumb, etc.

Some of Keith's suggestions, which have a biology slant:

  • IUPAC ambiguity codes for nucleotides:
  • Amino acid single letter codes.
  • SI prefixes in order.
  • Powers of 2.
  • Tm calculation estimation using G+C and A+T counts.
  • 1 human genome ~= 7 pg of DNA
  • 1 bp = 660 daltons
 
 
 
1

Could you please collect the answers and put them on a cheat sheet blog somewhere?

log in to reply • written 14 months ago by Chris Evelo  804722
 

And/or incorporating answers here would be nice.

log in to reply • written 14 months ago by Michael Schubert  5231515
 

Hey! Brilliant idea to have a cheat code. but this list will go endless unless you give sub categories, like the cheat code for researchers working in bioalgorithm development, genomics, data analysis etc... this will make it more organised.

log in to reply • written 14 months ago by Dataminer  131112
 
1

Instead of blog post, maybe github repo with Markdown/LaTeX would be better?

log in to reply • written 14 months ago by Piotr Byzia  111

22 answers

 
17
 
 
  • 5' : left
  • 3' : right

[?]

 
 
 
1

+1 for the smile while reading ;)

log in to reply • written 14 months ago by Michael Schubert  5231515
 
4

-1 my apologies to Pierre as my objection is rather pedantic; if you are looking at coordinates relative to the forward strand (e.g. Refgene), then a gene on the reverse strand would be 5' right and 3' left.

log in to reply • written 14 months ago by Ian  224311
 

@Ian fair enough :-)

log in to reply • written 14 months ago by Pierre Lindenbaum ♦♦ 351432768
 

I actually have a post it note on my cubicle wall with a little picture of genes on each strand and 5' and 3' with little arrows.

log in to reply • written 12 months ago by Madelaine Gogol  216314
 
 
15
 
 

Not completely bioinformatics oriented, but some things I've found handy.

#subtract a small file from a bigger file
grep -vf filesmall filebig

#use awk to rearrange columns
awk '{print $2 " " $1}' file.txt

#sort a bed file by chrom, position
sort -k1,1 -k2,2n file.bed > file.sort.bed

#strip header
tail +2 file > file.nh

#find and replace over multiple files
perl -pi -w -e 's/255,165,0/255,69,0/g' *.wig

#print line 83 from a file
sed -n '83p'

#insert a header line
sed -i -e '1itrack name=test type=bedGraph' file.bed

#sum column one from a file
awk '{s+=$1} END {print s}' mydatafile
 
 
 
1

one of my awk aliases is the mean and sd of column 1:

awk '{s+=$1;s2+=($1$1)} END {print s/NR,sqrt((NRs2-ss)/(NR(NR-1)))}'

log in to reply • written 6 months ago by Chris Penkett  3211
 

Oh, that could be useful too, thanks. awk is still a dark interesting rabbit hole to me.

log in to reply • written 6 months ago by Madelaine Gogol  216314
 
 
14
 
 

I have a vision of this cheat sheet being an extensive, very convenient set of environment variables and man pages. It should be versioned and should be on something like GitHub.

For example:

####################
# HG19
####################
$CHR1_SIZE=249250621
$CHR2_SIZE=243199373
...

####################
# Shortcuts
####################
$SUMCOL='awk '\''{ SUM += $1} END { print SUM}'\'

Other informational stats should be rolled into "man" entries. For example,

man dna
man iupac
man 2_powers
man log_examples

This may be utterly harebrained, but it seems useful to me. A community-based, focused wikipedia and shortcut library on the command line.

 
 
 

Dotfiles can be intensely personal things. That said, I'd love to have a big central repository of useful stuff to pick and choose from.

log in to reply • written 14 months ago by Chris Miller  657524
 

Fair point, yeah a repo that is organized by type would be more useful.

log in to reply • written 14 months ago by Aaronquinlan  471421
 
1

+1 for very clever idea. I like this a lot.

log in to reply • written 14 months ago by Casey Bergman  123921131
 

Eh, shortcuts isn't a cheat sheet, it's a .bashrc file. So:

sumcol(){ awk '{SUM += $1} END { print $SUM }' }

But +1 for the manual pages suggestion, one of the man pages I constantly return to is 'man ascii'. Can we create a 'b' section?

log in to reply • written 13 months ago by Ketil  15719
 
 
8
 
 

missing from the list

  • 1 nucleosome = 147bp
  • Crude AA to kilo dalton conversion = AA No X 0.11 =Kd
  • Perl one liners for text conversion
  • s/015012/012/ # Windows -> Unix
  • s/012/015012/ # Unix -> Windows
 
 
 
1

or perl -pi -e 's/rn/n/g' input.file

log in to reply • written 8 months ago by Ying W  8018
 
 
5
 
 

This reminds me a little bit of BioNumbers: http://bionumbers.hms.harvard.edu

 
 
 
 
4
 
 

My cheat sheet would contain the length of the human chromosomes.

 
 
 
5

For which assembly? :-P

log in to reply • written 14 months ago by Chris Miller  657524
 
 
4
 
 
  • AUG = Initiation

  • UAA, UGA, UAG= Termination

  • AT= 2 Hydrogen Bond, GC =3 Hygrogen Bond and adjacent bases are separated by 3.4Å

  • Purine= Adenine & Guanine AND Pyrimidines= Cytosine, Uracil & Thymine

  • DNA replication is semi-conservative

Coming more.... :D

 
 
 
 
4
 
 

The cheat sheet for programming in R would be what you are looking for.

Here are good manuals that my advisor, Thomas Girke, wrote:

The HT Sequence Analysis manual was as recommended in http://biostar.stackexchange.com/questions/539/recommend-your-favorite-introductory-r-in-bioinformatics-books-and-resources

 
 
 
 
4
 
 

Correspondance between the genome version nomenclature : [?]hg19 (UCSC) = GRCh37 (NCBI)[?]

 
 
 
 
3
 
 

I'll start off with a few of my own:

  • an alpha-helix has 3.6 residues per turn

  • A haploid human genome has a little over 3 billion bases and contains around 20,000 genes

  • A handy alias for summing up a column of numbers from the command line:

    sumcol='awk '\''{ SUM += $1} END { print SUM}'\'

 
 
 
 
3
 
 

I would like a cheat sheet of arguments for common bioinformatics executables (e.g. blast, clustal, bowtie, bwa, fastx-toolkit), the popular bioperl scripts (like bp_seqfeature_load.pl), as well as the most common bioinformatics things in bash (e.g. mass renaming: foreach f in *fasta; do mv $f echo $f | sed -e 's/.fasta/.fa'; done)

 
 
 

I find running the executable without arguments usually reminds me what they are ;-)

log in to reply • written 14 months ago by Neilfws ♦♦ 286011949
 
3

Check out the 'rename' program that come with perl is so much better. In this case

rename 's/fasta$/fa/' *fasta

(I assume it comes with perl as it was written by Larry Wall- it is standard on all the latest ubuntu systems)

log in to reply • written 14 months ago by Alastair Kerr  347411
 

Mass renaming is fun until you have to do it on someone else's directory and the file names are full of spaces and accents...

log in to reply • written 14 months ago by Eric Normandeau  505733
 

With bash, you can use pattern substitution: for f in *.fasta ; do mv $f ${f/fasta/fa} ; done. It is more than twice faster than calling for sed (on a set of 1000 files).

log in to reply • written 4 months ago by Frédéric Mahé  74310
 
 
3
 
 

building up my list here.. a blog post would be a good record for myself when i change computers or move office where i usually lose my printed copies.

http://kevin-gattaca.blogspot.com/2011/03/cheat-sheets-galore-bioinformatics.html

 
 
 
 
3
 
 

Very cool question, here's mine, which probably isn't all that relevant to most biostar members but is popular in our lab:

A table of nucleotide substitution models, and how to set them in the most commonly used programs

Still working on it (you can implement the exotic models in most of the software, but not easily)

 
 
 
 
3
 
 

My cheat sheet would be

Amino acid structures with their properties

And I would also consider Biostar because it is in fact more than google.

 
 
 
 
2
 
 

Great ideas! I'd add a Blosum62 substitution matrix to the list.

 
 
 
 
2
 
 
  • Amino acid weights, IEPs
  • some FASTA statistics one-liners
  • quick overview of possible cli BLAST inputs/outputs (reading -help takes so long as they are all over the place)
  • BLAST tabular output column names
  • Karlin-Altschul formula
  • definition of PAM and BLOSUM
  • order of AAs in a substitution matrix/PSSM
 
 
 
 
2
 
 

List of most used file formats (.pdb, .bam, .fastq, etc etc), what information they contain, and what they can be used for? (and possibly the most well-known software(s) that reads them)

 
 
 
 
2
 
 

great idea!

a bit of biology:

the citrus cycle! http://student.ccbcmd.edu/~gkaiser/biotutorials/cellresp/images/u4fg35.jpg

or here you can find it also along other must-knows http://www.dummies.com/how-to/content/molecular-cell-biology-for-dummies-cheat-sheet.html

for R & Regex I already have seperate cheatsheets on my desk. One thing i am missing tough, is a cheatsheet for Regex, referring to in which environment one has to escape which characters and backreferences ( or $)

 
 
 

i really like this website for regex http://www.sarand.com/td/ref_perl_pattern.html

log in to reply • written 8 months ago by Ying W  8018
 
 
1
 
 

I like this question, so at the risk of sounding trite: my cheat sheet = a Google search. I store very little information these days; it's as quick and easy to search for it as and when required.

 
 
 

that has an uneven success rate. which google query will lead you directly to the answer of, for example, what percentage of the human genome contained in transcription units

log in to reply • written 14 months ago by Jeremy Leipzig  820823
 
3

that has an uneven success rate. which google query will lead you directly to the answer of, for example, what is the percentage of the human genome contained in transcription units?

log in to reply • written 14 months ago by Jeremy Leipzig  820823
 

Did you try this one? "human genome percentage transcribed" It gives this as the first hit: http://bionumbers.hms.harvard.edu/bionumber.aspx?s=y&id=103746&ver=2

log in to reply • written 14 months ago by Chris Evelo  804722
 
1

Did you try this one? "human genome percentage transcribed" It gives this as the first hit: http://bionumbers.hms.harvard.edu/bionumber.aspx?s=y&id=103746&ver=2 That is a nice bonus, but the second hit: http://www.genome.gov/25521554 tells you what you asked for (1.5-2%)

log in to reply • written 14 months ago by Chris Evelo  804722
 
1

"At present, about one-third of the human genome appears to be transcribed" http://bit.ly/ga2YFU just the amount of surfing I had to do and still not find that number is evidence enough that a genomics cheat sheet would be handy thing

log in to reply • written 14 months ago by Jeremy Leipzig  820823
 
 
1
 
 
  • No. seqs(fasta):

    grep \> file_name | wc -l

 
 
 
2

there's even a shorter solution : grep -c ">" filename :-)

log in to reply • written 13 months ago by Pierre Lindenbaum ♦♦ 351432768
 

"grep > file_name" will just truncate your file. needs quotes around the ">" as per @Pierre.

log in to reply • written 13 months ago by brentp  12151135
 

the editor seems to escape my > so had to write \>

log in to reply • written 13 months ago by Biorelated  7327
 

grep -c "^>" your_fasta The ">" sign has to be the first on the line

log in to reply • written 13 months ago by Darked89  317313
 
1

@Pierre, I always liked the piped version more. It's only 3 or 5 symbols longer but it easily to swap wc with less or another grep à la LEGO.

log in to reply • written 13 months ago by Aleksandr Levchuk  252213
 
 
1
 
 

Biology by the numbers www.rpgroup.caltech.edu/publications/SnapShot2010.pdf

 
 
 
 
1
 
 

A useful addition would be a landscape or flowchart how to get from one file (format) into the next..... bioinformatics is about parsing it right....... :)

 
 
 

interesting idea .. but off the top of my head I can only think of fastq (1)-> bam (2)-> vcf (3) -> annotated SNPs list of which the path taken depends on the sofware used to (1) map/align (2) call SNPs etc ... Are there file formats that you are thinking about?

log in to reply • written 12 weeks ago by Kevin  354
 

Maybe some simples are the interconversion of fastq and fasta+qual; fastq (qual solexa) to fastq (sanger and so forth); conversion of annotation files like EMBL, GBK into each other and or gff; conversion of all sorts of IDs (but there are some good tools for that)....and may be some more....

log in to reply • written 11 weeks ago by ALchEmiXt  1259
 
Log in to add a post