Forum:Bioinformatics "Cheat Sheet"
24
110
Entering edit mode
13.1 years ago

Inspired by Keith Robison's post on 'cheat sheets', what would you put on a cheat sheet for bioinformatics? This might include one-line scripts, conversion factors, handy rules of thumb, etc.

Some of Keith's suggestions, which have a biology slant:

  • IUPAC ambiguity codes for nucleotides:
  • Amino acid single letter codes.
  • SI prefixes in order.
  • Powers of 2.
  • Tm calculation estimation using G+C and A+T counts.
  • 1 human genome ~= 7 pg of DNA
  • 1 bp = 660 daltons
cheat sheet • 20k views
ADD COMMENT
3
Entering edit mode

Could you please collect the answers and put them on a cheat sheet blog somewhere?

ADD REPLY
1
Entering edit mode

Instead of blog post, maybe github repo with Markdown/LaTeX would be better?

ADD REPLY
0
Entering edit mode

And/or incorporating answers here would be nice.

ADD REPLY
0
Entering edit mode

Hey! Brilliant idea to have a cheat code. but this list will go endless unless you give sub categories, like the cheat code for researchers working in bioalgorithm development, genomics, data analysis etc... this will make it more organised.

ADD REPLY
42
Entering edit mode
13.1 years ago
  • 5' : left
  • 3' : right

;-)

ADD COMMENT
10
Entering edit mode

-1 my apologies to Pierre as my objection is rather pedantic; if you are looking at coordinates relative to the forward strand (e.g. Refgene), then a gene on the reverse strand would be 5' right and 3' left.

ADD REPLY
8
Entering edit mode

+1 for the smile while reading ;)

ADD REPLY
1
Entering edit mode

I actually have a post it note on my cubicle wall with a little picture of genes on each strand and 5' and 3' with little arrows.

ADD REPLY
0
Entering edit mode

@Ian fair enough :-)

ADD REPLY
26
Entering edit mode
13.0 years ago

Not completely bioinformatics oriented, but some things I've found handy.

#subtract a small file from a bigger file
grep -vf filesmall filebig

#use awk to rearrange columns
awk '{print $2 " " $1}' file.txt

#sort a bed file by chrom, position
sort -k1,1 -k2,2n file.bed > file.sort.bed

#strip header
tail +2 file > file.nh

#find and replace over multiple files
perl -pi -w -e 's/255,165,0/255,69,0/g' *.wig

#print line 83 from a file
sed -n '83p'

#insert a header line
sed -i -e '1itrack name=test type=bedGraph' file.bed

#sum column one from a file
awk '{s+=$1} END {print s}' mydatafile
ADD COMMENT
2
Entering edit mode

One of my awk aliases is the mean and sd of column 1:

awk '{s+=$1;s2+=($1*$1)} END {print s/NR,sqrt((NR*s2-s*s)/(NR*(NR-1)))}'
ADD REPLY
1
Entering edit mode

Oh, that could be useful too, thanks. awk is still a dark interesting rabbit hole to me.

ADD REPLY
18
Entering edit mode
13.1 years ago

I have a vision of this cheat sheet being an extensive, very convenient set of environment variables and man pages. It should be versioned and should be on something like GitHub.

For example:

####################
# HG19
####################
$CHR1_SIZE=249250621
$CHR2_SIZE=243199373
...

####################
# Shortcuts
####################
$SUMCOL='awk '\''{ SUM += $1} END { print SUM}'\'

Other informational stats should be rolled into "man" entries. For example,

man dna
man iupac
man 2_powers
man log_examples

This may be utterly harebrained, but it seems useful to me. A community-based, focused wikipedia and shortcut library on the command line.

ADD COMMENT
3
Entering edit mode

Hi Aaron, I started this today :-) https://github.com/lindenb/bioman

ADD REPLY
2
Entering edit mode

+1 for very clever idea. I like this a lot.

ADD REPLY
0
Entering edit mode

Dotfiles can be intensely personal things. That said, I'd love to have a big central repository of useful stuff to pick and choose from.

ADD REPLY
0
Entering edit mode

Fair point, yeah a repo that is organized by type would be more useful.

ADD REPLY
0
Entering edit mode

Eh, shortcuts isn't a cheat sheet, it's a .bashrc file. So:

sumcol(){
   awk '{SUM += $1} END { print $SUM }'
}

But +1 for the manual pages suggestion, one of the man pages I constantly return to is man ascii. Can we create a b section?

ADD REPLY
8
Entering edit mode
13.1 years ago

missing from the list

  • 1 nucleosome = 147bp
  • Crude AA to kilo dalton conversion = AA No X 0.11 =Kd
  • Perl one liners for text conversion
  • s/015012/012/ # Windows -> Unix
  • s/012/015012/ # Unix -> Windows
ADD COMMENT
1
Entering edit mode

or

perl -pi -e 's/rn/n/g' input.file
ADD REPLY
8
Entering edit mode
10.0 years ago

I have a collection of handpicked reference cards. It helps every now and then.

I prefer to call it as a Bioinformatician's Pocket Reference!!

ADD COMMENT
7
Entering edit mode
13.1 years ago

Correspondance between the genome version nomenclature : hg19 (UCSC) = GRCh37 (NCBI)

ADD COMMENT
0
Entering edit mode

The UCSC Assembly Releases and Versions FAQ does a great job of summarizing a lot of these. Each genome build in the table lists: species, UCSC version, release date, release name/id, and status.

ADD REPLY
0
Entering edit mode

...except for chrM/MT where UCSC have a different sequence than the accepted correct one.

ADD REPLY
0
Entering edit mode

Warning: there is a small difference between hg19 and GRCh37 that make a significant influence in the downstream analysis:

in GRCh37, the chromosome name is 1,2,3,4,5,6,7,8,9,..., X, Y

in hg19, the chromosome name is chr1, chr2, chr3, chr4, ..., chrX, chrY

So the mapping results to hg19 cannot be used to GRCh37 directly.

Hope others can avoid the trap I fall in.

ADD REPLY
0
Entering edit mode

and some degenerate bases have been replaced by 'N' for chr3 and chrY. see: http://plindenbaum.blogspot.fr/2013/07/g1kv37-vs-hg19.html

ADD REPLY
6
Entering edit mode
13.1 years ago
Mary 11k

This reminds me a little bit of BioNumbers: http://bionumbers.hms.harvard.edu

ADD COMMENT
6
Entering edit mode
13.1 years ago
Thaman ★ 3.3k
  • AUG = Initiation

  • UAA, UGA, UAG= Termination

  • AT= 2 Hydrogen Bond, GC =3 Hygrogen Bond and adjacent bases are separated by 3.4Å

  • Purine= Adenine & Guanine AND Pyrimidines= Cytosine, Uracil & Thymine

  • DNA replication is semi-conservative

Coming more.... :D

ADD COMMENT
5
Entering edit mode
13.1 years ago
  • Amino acid weights, IEPs
  • some FASTA statistics one-liners
  • quick overview of possible cli BLAST inputs/outputs (reading -help takes so long as they are all over the place)
  • BLAST tabular output column names
  • Karlin-Altschul formula
  • definition of PAM and BLOSUM
  • order of AAs in a substitution matrix/PSSM
ADD COMMENT
5
Entering edit mode
13.1 years ago

The cheat sheet for programming in R would be what you are looking for.

Here are good manuals that my advisor, Thomas Girke, wrote:

The HT Sequence Analysis manual was as recommended in Recommend Your Favorite Introductory "R In Bioinformatics" Books And Resources

ADD COMMENT
4
Entering edit mode
13.1 years ago

I'll start off with a few of my own:

  • an alpha-helix has 3.6 residues per turn

  • A haploid human genome has a little over 3 billion bases and contains around 20,000 genes

  • A handy alias for summing up a column of numbers from the command line:

    sumcol='awk '\''{ SUM += $1} END { print SUM}'\'

ADD COMMENT
4
Entering edit mode
13.1 years ago

I would like a cheat sheet of arguments for common bioinformatics executables (e.g. blast, clustal, bowtie, bwa, fastx-toolkit), the popular bioperl scripts (like bp_seqfeature_load.pl), as well as the most common bioinformatics things in bash (e.g. mass renaming: foreach f in *fasta; do mv $f `echo $f | sed -e 's/.fasta/.fa'` done)

ADD COMMENT
4
Entering edit mode

Check out the 'rename' program that come with perl is so much better. In this case

rename 's/fasta$/fa/' *fasta

(I assume it comes with perl as it was written by Larry Wall- it is standard on all the latest ubuntu systems)

ADD REPLY
1
Entering edit mode

With bash, you can use pattern substitution: for f in *.fasta ; do mv $f ${f/fasta/fa} ; done. It is more than twice faster than calling for sed (on a set of 1000 files).

ADD REPLY
0
Entering edit mode

I find running the executable without arguments usually reminds me what they are ;-)

ADD REPLY
0
Entering edit mode

Mass renaming is fun until you have to do it on someone else's directory and the file names are full of spaces and accents...

ADD REPLY
4
Entering edit mode
13.1 years ago

My cheat sheet would contain the length of the human chromosomes.

ADD COMMENT
8
Entering edit mode

For which assembly? :-P

ADD REPLY
4
Entering edit mode
13.1 years ago
Kevin ▴ 640

building up my list here.. a blog post would be a good record for myself when i change computers or move office where i usually lose my printed copies.

http://kevin-gattaca.blogspot.com/2011/03/cheat-sheets-galore-bioinformatics.html

ADD COMMENT
3
Entering edit mode
13.1 years ago
David W 4.9k

Very cool question, here's mine, which probably isn't all that relevant to most biostar members but is popular in our lab:

A table of nucleotide substitution models, and how to set them in the most commonly used programs

Still working on it (you can implement the exotic models in most of the software, but not easily)

ADD COMMENT
3
Entering edit mode
13.1 years ago
Pals ★ 1.3k

My cheat sheet would be

Amino acid structures with their properties

And I would also consider Biostar because it is in fact more than google.

ADD COMMENT
3
Entering edit mode
12.6 years ago
ALchEmiXt ★ 1.9k

A useful addition would be a landscape or flowchart how to get from one file (format) into the next..... bioinformatics is about parsing it right....... :)

ADD COMMENT
0
Entering edit mode

interesting idea .. but off the top of my head I can only think of fastq (1)-> bam (2)-> vcf (3) -> annotated SNPs list of which the path taken depends on the sofware used to (1) map/align (2) call SNPs etc ... Are there file formats that you are thinking about?

ADD REPLY
0
Entering edit mode

Maybe some simples are the interconversion of fastq and fasta+qual; fastq (qual solexa) to fastq (sanger and so forth); conversion of annotation files like EMBL, GBK into each other and or gff; conversion of all sorts of IDs (but there are some good tools for that)....and may be some more....

ADD REPLY
2
Entering edit mode
13.1 years ago
Neilfws 49k

I like this question, so at the risk of sounding trite: my cheat sheet = a Google search. I store very little information these days; it's as quick and easy to search for it as and when required.

ADD COMMENT
3
Entering edit mode

that has an uneven success rate. which google query will lead you directly to the answer of, for example, what is the percentage of the human genome contained in transcription units?

ADD REPLY
2
Entering edit mode

"At present, about one-third of the human genome appears to be transcribed" http://bit.ly/ga2YFU just the amount of surfing I had to do and still not find that number is evidence enough that a genomics cheat sheet would be handy thing

ADD REPLY
1
Entering edit mode

Did you try this one? "human genome percentage transcribed" It gives this as the first hit: http://bionumbers.hms.harvard.edu/bionumber.aspx?s=y&id=103746&ver=2 That is a nice bonus, but the second hit: http://www.genome.gov/25521554 tells you what you asked for (1.5-2%)

ADD REPLY
0
Entering edit mode

that has an uneven success rate. which google query will lead you directly to the answer of, for example, what percentage of the human genome contained in transcription units

ADD REPLY
0
Entering edit mode

Did you try this one? "human genome percentage transcribed" It gives this as the first hit: http://bionumbers.hms.harvard.edu/bionumber.aspx?s=y&id=103746&ver=2

ADD REPLY
2
Entering edit mode
13.1 years ago
Paige ▴ 40

Great ideas! I'd add a Blosum62 substitution matrix to the list.

ADD COMMENT
2
Entering edit mode
13.1 years ago
Samuel Lampa ★ 1.3k

List of most used file formats (.pdb, .bam, .fastq, etc etc), what information they contain, and what they can be used for? (and possibly the most well-known software(s) that reads them)

ADD COMMENT
2
Entering edit mode
13.1 years ago
Michi ▴ 990

great idea!

a bit of biology:

the citrus cycle! http://student.ccbcmd.edu/~gkaiser/biotutorials/cellresp/images/u4fg35.jpg

or here you can find it also along other must-knows

for R & Regex I already have separate cheatsheets on my desk. One thing I am missing tough, is a cheatsheet for Regex, referring to in which environment one has to escape which characters and back-references (\ or $)

ADD COMMENT
0
Entering edit mode

i really like this website for regex http://www.sarand.com/td/ref_perl_pattern.html

ADD REPLY
2
Entering edit mode
13.0 years ago
Goldbear ▴ 130

Biology by the numbers

http://www.rpgroup.caltech.edu/publications/SnapShot2010.pdf

ADD COMMENT
1
Entering edit mode
13.1 years ago
hadasa ★ 1.0k
  • No. seqs(fasta):

    grep \> file_name | wc -l
    
ADD COMMENT
2
Entering edit mode

there's even a shorter solution : grep -c ">" filename :-)

ADD REPLY
1
Entering edit mode

@Pierre, I always liked the piped version more. It's only 3 or 5 symbols longer but it easily to swap wc with less or another grep à la LEGO.

ADD REPLY
0
Entering edit mode

grep > file_name will just truncate your file. needs quotes around the ">" as per @Pierre.

ADD REPLY
0
Entering edit mode

the editor seems to escape my > so had to write \>

ADD REPLY
0
Entering edit mode
grep -c "^>" your_fasta

The > sign has to be the first on the line

ADD REPLY
0
Entering edit mode
10.0 years ago
Prakki Rama ★ 2.7k
##Tabulated BLAST header
qseqid sseqid pident alignlength mismatch gapopen qstart qend sstart send evalue bitscore **- **

## go to the end of file in Vi editor
G (shift + g )** **

##substitute in Vi editor

:%s/Soxgene/Foxgene/g

##remove exact duplicate sequences from fasta file

sed -e '/^>/s/$/@/' -e 's/^>/#/' file.fasta | tr -d '\n'|tr "#" "\n"| tr "@" "\t" |sort -u -t ' ' -f -k 2,2 |sed '/^$/d'|sed -e 's/^/>/' -e 's/\t/\n/'

##remove blank lines
sed '/^$/d' file.fasta >Noblanks_file.fasta.out
ADD COMMENT

Login before adding your answer.

Traffic: 2460 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6