Question: Bioinformatics "Cheat Sheet"
64
gravatar for Chris Miller
6.5 years ago by
Chris Miller18k
Washington University in St. Louis, MO
Chris Miller18k wrote:

Inspired by Keith Robison's post on 'cheat sheets', what would you put on a cheat sheet for bioinformatics? This might include one-line scripts, conversion factors, handy rules of thumb, etc.

Some of Keith's suggestions, which have a biology slant:

  • IUPAC ambiguity codes for nucleotides:
  • Amino acid single letter codes.
  • SI prefixes in order.
  • Powers of 2.
  • Tm calculation estimation using G+C and A+T counts.
  • 1 human genome ~= 7 pg of DNA
  • 1 bp = 660 daltons
• 10k views
ADD COMMENTlink modified 11 months ago by jokipokemon00020 • written 6.5 years ago by Chris Miller18k
2

Could you please collect the answers and put them on a cheat sheet blog somewhere?

ADD REPLYlink written 6.5 years ago by Chris Evelo9.9k
1

Instead of blog post, maybe github repo with Markdown/LaTeX would be better?

ADD REPLYlink written 6.5 years ago by Piotr Byzia10

And/or incorporating answers here would be nice.

ADD REPLYlink written 6.5 years ago by Michael Schubert6.7k

Hey! Brilliant idea to have a cheat code. but this list will go endless unless you give sub categories, like the cheat code for researchers working in bioalgorithm development, genomics, data analysis etc... this will make it more organised.

ADD REPLYlink written 6.5 years ago by Dataminer2.5k
35
gravatar for Pierre Lindenbaum
6.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum98k wrote:
  • 5' : left
  • 3' : right

;-)

ADD COMMENTlink modified 5.0 years ago by Istvan Albert ♦♦ 73k • written 6.5 years ago by Pierre Lindenbaum98k
8

-1 my apologies to Pierre as my objection is rather pedantic; if you are looking at coordinates relative to the forward strand (e.g. Refgene), then a gene on the reverse strand would be 5' right and 3' left.

ADD REPLYlink written 6.5 years ago by Ian4.9k
6

+1 for the smile while reading ;)

ADD REPLYlink written 6.5 years ago by Michael Schubert6.7k
1

I actually have a post it note on my cubicle wall with a little picture of genes on each strand and 5' and 3' with little arrows.

ADD REPLYlink written 6.4 years ago by Madelaine Gogol4.8k

@Ian fair enough :-)

ADD REPLYlink written 6.5 years ago by Pierre Lindenbaum98k
26
gravatar for Madelaine Gogol
6.4 years ago by
Madelaine Gogol4.8k
Kansas City
Madelaine Gogol4.8k wrote:

Not completely bioinformatics oriented, but some things I've found handy.

#subtract a small file from a bigger file
grep -vf filesmall filebig

#use awk to rearrange columns
awk '{print $2 " " $1}' file.txt

#sort a bed file by chrom, position
sort -k1,1 -k2,2n file.bed > file.sort.bed

#strip header
tail +2 file > file.nh

#find and replace over multiple files
perl -pi -w -e 's/255,165,0/255,69,0/g' *.wig

#print line 83 from a file
sed -n '83p'

#insert a header line
sed -i -e '1itrack name=test type=bedGraph' file.bed

#sum column one from a file
awk '{s+=$1} END {print s}' mydatafile
ADD COMMENTlink modified 5.9 years ago • written 6.4 years ago by Madelaine Gogol4.8k
2

one of my awk aliases is the mean and sd of column 1:

awk '{s+=$1;s2+=($1$1)} END {print s/NR,sqrt((NRs2-ss)/(NR(NR-1)))}'

ADD REPLYlink written 5.9 years ago by Chris Penkett470
1

Oh, that could be useful too, thanks. awk is still a dark interesting rabbit hole to me.

ADD REPLYlink written 5.9 years ago by Madelaine Gogol4.8k
18
gravatar for Aaronquinlan
6.5 years ago by
Aaronquinlan9.9k
United States
Aaronquinlan9.9k wrote:

I have a vision of this cheat sheet being an extensive, very convenient set of environment variables and man pages. It should be versioned and should be on something like GitHub.

For example:

####################
# HG19
####################
$CHR1_SIZE=249250621
$CHR2_SIZE=243199373
...

####################
# Shortcuts
####################
$SUMCOL='awk '\''{ SUM += $1} END { print SUM}'\'

Other informational stats should be rolled into "man" entries. For example,

man dna
man iupac
man 2_powers
man log_examples

This may be utterly harebrained, but it seems useful to me. A community-based, focused wikipedia and shortcut library on the command line.

ADD COMMENTlink written 6.5 years ago by Aaronquinlan9.9k
2

+1 for very clever idea. I like this a lot.

ADD REPLYlink written 6.5 years ago by Casey Bergman17k
2

Hi Aaron, I started this today :-) https://github.com/lindenb/bioman

ADD REPLYlink written 4.4 years ago by Pierre Lindenbaum98k

Dotfiles can be intensely personal things. That said, I'd love to have a big central repository of useful stuff to pick and choose from.

ADD REPLYlink written 6.5 years ago by Chris Miller18k

Fair point, yeah a repo that is organized by type would be more useful.

ADD REPLYlink written 6.5 years ago by Aaronquinlan9.9k

Eh, shortcuts isn't a cheat sheet, it's a .bashrc file. So:

sumcol(){ awk '{SUM += $1} END { print $SUM }' }

But +1 for the manual pages suggestion, one of the man pages I constantly return to is 'man ascii'. Can we create a 'b' section?

ADD REPLYlink written 6.5 years ago by Ketil3.8k
8
gravatar for Alastair Kerr
6.5 years ago by
Alastair Kerr5.2k
The University of Edinburgh, UK
Alastair Kerr5.2k wrote:

missing from the list

  • 1 nucleosome = 147bp
  • Crude AA to kilo dalton conversion = AA No X 0.11 =Kd
  • Perl one liners for text conversion
  • s/015012/012/ # Windows -> Unix
  • s/012/015012/ # Unix -> Windows
ADD COMMENTlink written 6.5 years ago by Alastair Kerr5.2k
1

or perl -pi -e 's/rn/n/g' input.file

ADD REPLYlink written 6.1 years ago by Ying W3.6k
7
gravatar for Fred Fleche
6.5 years ago by
Fred Fleche4.2k
Paris, France
Fred Fleche4.2k wrote:

Correspondance between the genome version nomenclature : hg19 (UCSC) = GRCh37 (NCBI)

ADD COMMENTlink modified 5.0 years ago by Istvan Albert ♦♦ 73k • written 6.5 years ago by Fred Fleche4.2k

The UCSC Assembly Releases and Versions FAQ does a great job of summarizing a lot of these.  Each genome build in the table lists: species, UCSC version, release date, release name/id, and status.

ADD REPLYlink written 3.3 years ago by Malachi Griffith15k

...except for chrM/MT where UCSC have a different sequence than the accepted correct one. 

ADD REPLYlink written 2.5 years ago by Danielk530

Warning: there is a small difference between hg19 and GRCh37 that make a significant influence in the downstream analysis:

in GRCh37, the chromosome name is 1,2,3,4,5,6,7,8,9,..., X, Y

in hg19, the chromosome name is chr1, chr2, chr3, chr4, ..., chrX, chrY

So the mapping results to hg19 cannot be used to GRCh37 directly.

Hope others can avoid the trap I fall in.

ADD REPLYlink written 4 days ago by Chen560

and some degenerate bases have been replaced by 'N' for chr3 and chrY. see: http://plindenbaum.blogspot.fr/2013/07/g1kv37-vs-hg19.html

ADD REPLYlink written 4 days ago by Pierre Lindenbaum98k
7
gravatar for amolkolte1989
3.4 years ago by
Germany
amolkolte198970 wrote:

I have a collection of handpicked reference cards. It helps every now and then.

I prefer to call it as a Bioinformatician's Pocket Reference!! 

ADD COMMENTlink written 3.4 years ago by amolkolte198970

Useful. Tweeted :)

ADD REPLYlink written 3.4 years ago by Ian4.9k
6
gravatar for Thaman
6.5 years ago by
Thaman3.2k
Finland
Thaman3.2k wrote:
  • AUG = Initiation

  • UAA, UGA, UAG= Termination

  • AT= 2 Hydrogen Bond, GC =3 Hygrogen Bond and adjacent bases are separated by 3.4Å

  • Purine= Adenine & Guanine AND Pyrimidines= Cytosine, Uracil & Thymine

  • DNA replication is semi-conservative

Coming more.... :D

ADD COMMENTlink modified 6.5 years ago • written 6.5 years ago by Thaman3.2k
5
gravatar for Mary
6.5 years ago by
Mary11k
Boston MA area
Mary11k wrote:

This reminds me a little bit of BioNumbers: http://bionumbers.hms.harvard.edu

ADD COMMENTlink written 6.5 years ago by Mary11k
5
gravatar for Michael Schubert
6.5 years ago by
Cambridge, UK
Michael Schubert6.7k wrote:
  • Amino acid weights, IEPs
  • some FASTA statistics one-liners
  • quick overview of possible cli BLAST inputs/outputs (reading -help takes so long as they are all over the place)
  • BLAST tabular output column names
  • Karlin-Altschul formula
  • definition of PAM and BLOSUM
  • order of AAs in a substitution matrix/PSSM
ADD COMMENTlink written 6.5 years ago by Michael Schubert6.7k
5
gravatar for Aleksandr Levchuk
6.5 years ago by
United States
Aleksandr Levchuk3.0k wrote:

The cheat sheet for programming in R would be what you are looking for.

Here are good manuals that my advisor, Thomas Girke, wrote:

The HT Sequence Analysis manual was as recommended in http://biostar.stackexchange.com/questions/539/recommend-your-favorite-introductory-r-in-bioinformatics-books-and-resources

ADD COMMENTlink modified 6.5 years ago • written 6.5 years ago by Aleksandr Levchuk3.0k
4
gravatar for Chris Miller
6.5 years ago by
Chris Miller18k
Washington University in St. Louis, MO
Chris Miller18k wrote:

I'll start off with a few of my own:

  • an alpha-helix has 3.6 residues per turn

  • A haploid human genome has a little over 3 billion bases and contains around 20,000 genes

  • A handy alias for summing up a column of numbers from the command line:

    sumcol='awk '\''{ SUM += $1} END { print SUM}'\'

ADD COMMENTlink modified 6.5 years ago by Tim320 • written 6.5 years ago by Chris Miller18k
4
gravatar for Jeremy Leipzig
6.5 years ago by
Philadelphia, PA
Jeremy Leipzig17k wrote:

I would like a cheat sheet of arguments for common bioinformatics executables (e.g. blast, clustal, bowtie, bwa, fastx-toolkit), the popular bioperl scripts (like bp_seqfeature_load.pl), as well as the most common bioinformatics things in bash (e.g. mass renaming: foreach f in *fasta; do mv $f echo $f | sed -e 's/.fasta/.fa'; done)

ADD COMMENTlink written 6.5 years ago by Jeremy Leipzig17k
4

Check out the 'rename' program that come with perl is so much better. In this case

rename 's/fasta$/fa/' *fasta

(I assume it comes with perl as it was written by Larry Wall- it is standard on all the latest ubuntu systems)

ADD REPLYlink written 6.5 years ago by Alastair Kerr5.2k
1

With bash, you can use pattern substitution: for f in *.fasta ; do mv $f ${f/fasta/fa} ; done. It is more than twice faster than calling for sed (on a set of 1000 files).

ADD REPLYlink written 5.7 years ago by Frédéric Mahé2.6k

I find running the executable without arguments usually reminds me what they are ;-)

ADD REPLYlink written 6.5 years ago by Neilfws47k

Mass renaming is fun until you have to do it on someone else's directory and the file names are full of spaces and accents...

ADD REPLYlink written 6.5 years ago by Eric Normandeau9.5k
4
gravatar for Pierre Lindenbaum
6.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum98k wrote:

My cheat sheet would contain the length of the human chromosomes.

ADD COMMENTlink written 6.5 years ago by Pierre Lindenbaum98k
8

For which assembly? :-P

ADD REPLYlink written 6.5 years ago by Chris Miller18k
4
gravatar for Kevin
6.5 years ago by
Kevin540
Kevin540 wrote:

building up my list here.. a blog post would be a good record for myself when i change computers or move office where i usually lose my printed copies.

http://kevin-gattaca.blogspot.com/2011/03/cheat-sheets-galore-bioinformatics.html

ADD COMMENTlink modified 5.0 years ago by Istvan Albert ♦♦ 73k • written 6.5 years ago by Kevin540
3
gravatar for David W
6.5 years ago by
David W4.6k
New Zealand
David W4.6k wrote:

Very cool question, here's mine, which probably isn't all that relevant to most biostar members but is popular in our lab:

A table of nucleotide substitution models, and how to set them in the most commonly used programs

Still working on it (you can implement the exotic models in most of the software, but not easily)

ADD COMMENTlink written 6.5 years ago by David W4.6k
3
gravatar for Pals
6.5 years ago by
Pals1.3k
Finland
Pals1.3k wrote:

My cheat sheet would be

Amino acid structures with their properties

And I would also consider Biostar because it is in fact more than google.

ADD COMMENTlink modified 6.5 years ago • written 6.5 years ago by Pals1.3k
3
gravatar for ALchEmiXt
6.1 years ago by
ALchEmiXt1.8k
The Netherlands
ALchEmiXt1.8k wrote:

A useful addition would be a landscape or flowchart how to get from one file (format) into the next..... bioinformatics is about parsing it right....... :)

ADD COMMENTlink written 6.1 years ago by ALchEmiXt1.8k

interesting idea .. but off the top of my head I can only think of fastq (1)-> bam (2)-> vcf (3) -> annotated SNPs list of which the path taken depends on the sofware used to (1) map/align (2) call SNPs etc ... Are there file formats that you are thinking about?

ADD REPLYlink written 5.6 years ago by Kevin540

Maybe some simples are the interconversion of fastq and fasta+qual; fastq (qual solexa) to fastq (sanger and so forth); conversion of annotation files like EMBL, GBK into each other and or gff; conversion of all sorts of IDs (but there are some good tools for that)....and may be some more....

ADD REPLYlink written 5.6 years ago by ALchEmiXt1.8k
2
gravatar for Neilfws
6.5 years ago by
Neilfws47k
Sydney, Australia
Neilfws47k wrote:

I like this question, so at the risk of sounding trite: my cheat sheet = a Google search. I store very little information these days; it's as quick and easy to search for it as and when required.

ADD COMMENTlink modified 6.5 years ago • written 6.5 years ago by Neilfws47k
3

that has an uneven success rate. which google query will lead you directly to the answer of, for example, what is the percentage of the human genome contained in transcription units?

ADD REPLYlink written 6.5 years ago by Jeremy Leipzig17k
2

"At present, about one-third of the human genome appears to be transcribed" http://bit.ly/ga2YFU just the amount of surfing I had to do and still not find that number is evidence enough that a genomics cheat sheet would be handy thing

ADD REPLYlink written 6.5 years ago by Jeremy Leipzig17k
1

Did you try this one? "human genome percentage transcribed" It gives this as the first hit: http://bionumbers.hms.harvard.edu/bionumber.aspx?s=y&id=103746&ver=2 That is a nice bonus, but the second hit: http://www.genome.gov/25521554 tells you what you asked for (1.5-2%)

ADD REPLYlink written 6.5 years ago by Chris Evelo9.9k

that has an uneven success rate. which google query will lead you directly to the answer of, for example, what percentage of the human genome contained in transcription units

ADD REPLYlink written 6.5 years ago by Jeremy Leipzig17k

Did you try this one? "human genome percentage transcribed" It gives this as the first hit: http://bionumbers.hms.harvard.edu/bionumber.aspx?s=y&id=103746&ver=2

ADD REPLYlink written 6.5 years ago by Chris Evelo9.9k
2
gravatar for Paige
6.5 years ago by
Paige40
Paige40 wrote:

Great ideas! I'd add a Blosum62 substitution matrix to the list.

ADD COMMENTlink written 6.5 years ago by Paige40
2
gravatar for Samuel Lampa
6.5 years ago by
Samuel Lampa1.1k
Stockholm
Samuel Lampa1.1k wrote:

List of most used file formats (.pdb, .bam, .fastq, etc etc), what information they contain, and what they can be used for? (and possibly the most well-known software(s) that reads them)

ADD COMMENTlink written 6.5 years ago by Samuel Lampa1.1k
2
gravatar for Michi
6.5 years ago by
Michi880
Barcelona
Michi880 wrote:

great idea!

a bit of biology:

the citrus cycle! http://student.ccbcmd.edu/~gkaiser/biotutorials/cellresp/images/u4fg35.jpg

or here you can find it also along other must-knows http://www.dummies.com/how-to/content/molecular-cell-biology-for-dummies-cheat-sheet.html

for R & Regex I already have seperate cheatsheets on my desk. One thing i am missing tough, is a cheatsheet for Regex, referring to in which environment one has to escape which characters and backreferences ( or $)

ADD COMMENTlink written 6.5 years ago by Michi880

i really like this website for regex http://www.sarand.com/td/ref_perl_pattern.html

ADD REPLYlink written 6.1 years ago by Ying W3.6k
2
gravatar for Goldbear
6.4 years ago by
Goldbear130
Goldbear130 wrote:

Biology by the numbers

http://www.rpgroup.caltech.edu/publications/SnapShot2010.pdf

ADD COMMENTlink modified 5.0 years ago by Istvan Albert ♦♦ 73k • written 6.4 years ago by Goldbear130
1
gravatar for hadasa
6.5 years ago by
hadasa1.0k
hadasa1.0k wrote:
  • No. seqs(fasta):

    grep \> file_name | wc -l

ADD COMMENTlink modified 6.5 years ago • written 6.5 years ago by hadasa1.0k
2

there's even a shorter solution : grep -c ">" filename :-)

ADD REPLYlink written 6.5 years ago by Pierre Lindenbaum98k
1

@Pierre, I always liked the piped version more. It's only 3 or 5 symbols longer but it easily to swap wc with less or another grep à la LEGO.

ADD REPLYlink written 6.5 years ago by Aleksandr Levchuk3.0k

"grep > file_name" will just truncate your file. needs quotes around the ">" as per @Pierre.

ADD REPLYlink written 6.5 years ago by brentp22k

the editor seems to escape my > so had to write \>

ADD REPLYlink written 6.5 years ago by hadasa1.0k

grep -c "^>" your_fasta The ">" sign has to be the first on the line

ADD REPLYlink written 6.5 years ago by Darked894.1k
0
gravatar for Prakki Rama
3.4 years ago by
Prakki Rama2.0k
Singapore
Prakki Rama2.0k wrote:
##Tabulated BLAST header
qseqid sseqid pident alignlength mismatch gapopen qstart qend sstart send evalue bitscore - 

##go to the end of file in Vi editor
G (shift + g ) 

##substitute in Vi editor

:%s/Soxgene/Foxgene/g

##remove exact duplicate sequences from fasta file

sed -e '/^>/s/$/@/' -e 's/^>/#/' file.fasta | tr -d '\n'|tr "#" "\n"| tr "@" "\t" |sort -u -t ' ' -f -k 2,2 |sed '/^$/d'|sed -e 's/^/>/' -e 's/\t/\n/'

##remove blank lines

sed '/^$/d' file.fasta >Noblanks_file.fasta.out

 

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Prakki Rama2.0k
0
gravatar for klemen
2.5 years ago by
klemen160
Slovenia
klemen160 wrote:

Dear all!

I really like to discussion here and all those hints to make our life easier. We have prepared our free version of NGS Data Analysis Cheat Sheet. It includes some procedures for mapping of reads, variant calling, differential expression analysis and more. Here is a quick preview: http://genial.is/eJHsj

In exchange we are kindly asking you to answer a short survey about problems in bioinformatics. It will only take you a few minutes to go through. At the end of the survey you will find the link to the Cheat Sheet.
Thank you!

Get NGS Data Analysis Cheat Sheet now >> http://genial.is/zL0aQ

ADD COMMENTlink written 2.5 years ago by klemen160
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1552 users visited in the last hour