Question: What Are Numbers Every Bioinformatician Should Know?
21
gravatar for brentp
6.2 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

There is a well-known set of numbers that every computer scientist should know by Jeff Dean. See http://www.quora.com/What-are-the-numbers-that-every-computer-engineer-should-know-according-to-Jeff-Dean

What are the sequencing related numbers that a bioinformatician should know? For example,

  • Reads from Illumina Hi-Seq 2000 lane: 350 million
  • Number of reads to cover 3.0GB of sequence at 30X: 900 million (for 2x100bp)
  • Cost per Hi-Seq 2000 lane: $2,500

If you put your answer in that format, I'll aggregate the best ones.

In addition to whatever you deem important, I'm interested in recommended read-counts for RNA-Seq, ChIP-Seq, file sizes for BAMs and fastq.gz's per number of reads, processing times for aligning to 3.0GB of reference for common aligners, etc.

bioinformatics • 6.1k views
ADD COMMENTlink modified 6.2 years ago by Giovanni M Dall'Olio26k • written 6.2 years ago by brentp23k
8

"3": Never write code after 3 beers.

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by Aaronquinlan11k
3

Correct. Two is optimal. Relevant xkcd: http://xkcd.com/323/

ADD REPLYlink written 6.2 years ago by Daniel3.7k
1

At this point, there's always a relevant xkcd :)

ADD REPLYlink written 6.2 years ago by Chris Miller21k
2

Might be some relevant numbers here: Bioinformatics "Cheat Sheet"

ADD REPLYlink written 6.2 years ago by Chris Miller21k
1

Might want to use the "cost" price of sequencing technologies, because prices vary significantly according to country/continent/provider etc (cost price for a HiSeq lane is quite a bit lower than $2,500)

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by gammyknee200
1

we should ask this question every 1-2 years, and see how these numbers change

ADD REPLYlink written 6.2 years ago by Giovanni M Dall'Olio26k
1

Is 350 Mreads realistic for HiSeq? My lab usually counts on 200 Mreads, but we also do lots of different kinds of samples, so we can't count on every library behaving completely consistently, which maybe some groups can.

ADD REPLYlink written 6.2 years ago by swbarnes26.9k

I wonder if it refers to single-end reads - I thought Illumina's reference value was a minimum of somewhere around 140M paired end reads?! (even though you sometimes get much more, of course)

ADD REPLYlink written 6.2 years ago by Mikael Huss4.7k

I'll look around and then update. I'm not confident on those numbers. I would call 140M read-pairs 280M reads.

ADD REPLYlink written 6.2 years ago by brentp23k

42 for geeky jokes on parties nobody thinks funny

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by Michael Dondrup46k
11
gravatar for Neilfws
6.2 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

Percentage of everything that's crap: 90.

ADD COMMENTlink written 6.2 years ago by Neilfws48k

Funny. Sturgeon's Law is very quotable, was written 60+ years ago, and about Science Fiction, a genre I always liked, yet I never heard it before :)

ADD REPLYlink written 6.2 years ago by Eric Normandeau10k
7
gravatar for Giovanni M Dall'Olio
6.2 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

There is a nice review, published just today, resuming some numbers about genetic variability in the human genome:

Since there has been a discussion on this matter in this thread, I'll resume some of these numbers here. The paper contains many references, which I am too lazy to copy here... So, if you want to know more, just read the paper.

It is important to note that there is only slightly more diversity between individuals of two major continental group, than between individuals of the same population. Craig Venter and Jim Watson (both "caucasian") share less SNVs between themselves that either of them shares with Seong-Jin Kim (a korean scientist).

General Chimp/Human differences:

  • number of single-nucleotide differences between chimp and human: ~35 millions (+5 milliion insertion/deletion events)
  • percentage of single-nucleotide changes between chimp and human: 1.23%
  • percentage of single-nucleotide changes that are fixed in chimp or in human: 1.06% over 1.23%

Variability between individuals:

  • number of SNVs in humans: ~65 millions
  • on average, a pair of humans is expected to differ on 1 base every 1000

Fst, pre-1000 Genomes (Fst is a measure of genetic differentiation; 0 -> no differentiation between individuals; 1->highest genetic differentiation):

  • average Fst among major continental group ranges from 0.05 to 0.13
  • genetic diversity in humans is far lower than in other primates. Genetic variance in humans is only 5-13% of the variance in other primates.

Fst, after 1000 Genomes

  • Fst between African and Europeans: 0.071
  • Fst between African and Asians: 0.083
  • Fst between Asians and Europeans: 0.052
  • Fst in Gorilla populations: 0.38
  • Fst in Chimp: 0.32

Allele sharing across continents

  • 81.2% of SNVs are present in all continental groups (12.4% if we consider haplotypes instead of SNVs)
  • less than 1% SNVs are specific to a continent (11% if we consider haplotypes instead of SNVs)
  • only 0.06% of SNVs are specific to Eurasia

Haplotype Sharing across continents

  • 2% haplotype blocks restricted to Asia
  • 2% haplotype blocks restricted to Europe
  • 25% haplotype blocks restricted to Africa

Major Genetic Groups

  • According to Rosenberg et al (the Structure software), all human individuals can be classified into 5 continental groups
  • Li et al confirmed the same, on the HGDP panel
  • however, recent attempts failed to confirm this classification, claiming that it may be due to confounding effects.
  • races according to the US census system: 15 plus "other races". I recommend you to read this book if you are interested on the matter of races in scientific use
ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Giovanni M Dall'Olio26k

Very nice, this could be inserted into the wikipedia article as a reference. Somebody willing to add it?

ADD REPLYlink written 6.2 years ago by Michael Dondrup46k

This is a good write-up, but if you think this means there cannot be substantial and systematic differences due to biology between different human groups you are guilty of Lewontin's fallacy. See http://infoproc.blogspot.no/2008/01/no-scientific-basis-for-race.html

Furthermore it does not make much sense to compare the difference between groups and withn groups. See http://evoandproud.blogspot.no/2011/11/apples-oranges-and-genes.html

That "genetic diversity in humans is far lower than in other primates" might be true, but the story is different in canines. I quote "there is less mtDNA difference between dogs, wolves and coyotes than there is between the various ethnic groups of human beings, which are recognized as belonging to a single species" - Evolution of working dogs J. Serpell (Ed.), The Domestic Dog: Its Evolution, Behaviour and Interactions with People, Cambridge University Press

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by Click downvote670

This may be useful for SNV vs SNP reference: SNP, DIP, SNV notation

ADD REPLYlink written 6.2 years ago by Daniel3.7k
6
gravatar for Philipp Bayer
6.2 years ago by
Philipp Bayer6.5k
Australia/Perth/UWA
Philipp Bayer6.5k wrote:

The PHRED-offsets for all different FASTQ-standards:

  • Sanger has 0-93, using ASCII 33 to 126
  • Solexa/Illumina 1.0 has -5 to 62, using ASCII 59 to 126
  • Solexa/Illumina 1.3 has 0 to 62 using ASCII 64 to 126
  • Solexa/Illumina 1.8 has 0-93, using ASCII 33 to 126 (same as Sanger)

This leads to the question: What's a good/reliable quality score to use as a cut-off? I usually use either 30 or 40 in all formats, but I'm open to suggestions. Of course, how high your cutoff is depends on what you want to do with your data. SNP-scoring needs more reliable bases than genome assembly.

Source

ADD COMMENTlink written 6.2 years ago by Philipp Bayer6.5k
5
gravatar for Malachi Griffith
6.2 years ago by
Washington University School of Medicine, St. Louis, USA
Malachi Griffith18k wrote:

Fundamental statistics about the genome/transcriptome of the species being studied. For example, for human:

  • Contig length total 3.2 Gb.
  • Chromosome length total 3.1 Gb.
  • Base Pairs: 3,323,950,079
  • Coding genes: 20,774
  • Non coding genes: 22,493
  • Pseudogenes: 14,145
  • Gene transcripts: 194,846
  • Short Variants (SNPs, indels, somatic mutations): 54,965,377
  • Structural variants: 10,266,123

Ensembl makes these stats available in nice regularly updated reports for many species such as: human, mouse, rat, etc...

ADD COMMENTlink written 6.2 years ago by Malachi Griffith18k
4
gravatar for Zev.Kronenberg
6.2 years ago by
United States
Zev.Kronenberg11k wrote:

Just some ramblings:

log10(x) != log(x)

Know your model organism:

23 human chromosomes

20 mouse chromosomes

~3 million non-reference SNVs per human genome

transition to transversion ratio for human exomes ~ 2

SAM flags

4 - unaligned

12 - both unaligned (mate pair).

ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Zev.Kronenberg11k
1

I guess you're counting haploid number of autosomes + one of X/Y for chromosomes. Not to be confused with "all possible chromosomes": for humans 1-22, X, Y + M = 25.

ADD REPLYlink written 6.2 years ago by Neilfws48k

and you are talking about a diplod organism, for human: (1-22) * 2 + (XX or XY) + M = 47

ADD REPLYlink written 6.2 years ago by JC9.1k
1

Handy tool for explaining all SAM flags and their combinations: http://picard.sourceforge.net/explain-flags.html

ADD REPLYlink written 6.2 years ago by Malachi Griffith18k

Use it almost every day.

ADD REPLYlink written 6.2 years ago by Zev.Kronenberg11k
4
gravatar for Woa
6.2 years ago by
Woa2.7k
United States
Woa2.7k wrote:

Check also the database of useful biological numbers: http://bionumbers.hms.harvard.edu/

ADD COMMENTlink written 6.2 years ago by Woa2.7k
4
gravatar for Malachi Griffith
6.2 years ago by
Washington University School of Medicine, St. Louis, USA
Malachi Griffith18k wrote:

Reference genome build/version numbers according to UCSC and NCBI (or sequencing consortium for the species of interest).

For example, hg19 = GRCh37.

The UCSC releases FAQ has a lot of these.

ADD COMMENTlink written 6.2 years ago by Malachi Griffith18k
3
gravatar for Istvan Albert
6.2 years ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

Some ideas:

  1. In the Sanger encoding runs of qualities that seem like censored swearing in a comics i.e. $#&%! correspond to very low qualities. Runs of qualities composed of readable letters correspond to very high qualities.
  2. One always has to be aware of the genome size of the organism that is being sequenced. That is usually the first question I ask when someone starts talking about a genome I haven't heard of before.
ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Istvan Albert ♦♦ 81k
1

Addendum to the genome size: Number of chromosomes is also important

ADD REPLYlink written 6.2 years ago by Philipp Bayer6.5k
3
gravatar for koukougogo
6.2 years ago by
koukougogo50
koukougogo50 wrote:

Taxonomy ID for human: 9606, for mouse: 10090

ADD COMMENTlink written 6.2 years ago by koukougogo50

E. coli K12 511145

ADD REPLYlink written 6.2 years ago by Asaf6.4k
3
gravatar for Adam
6.2 years ago by
Adam990
United States
Adam990 wrote:

I always found this book very useful for various 'should know' facts and figures: A Short Guide to the Human Genome

ADD COMMENTlink written 6.2 years ago by Adam990
1
gravatar for Ryan Thompson
6.2 years ago by
Ryan Thompson3.4k
TSRI, La Jolla, CA
Ryan Thompson3.4k wrote:

Roughly 15% of all human genetic variation is attributable to "race" population, and the other 85% exists between people of the same race population.

ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Ryan Thompson3.4k
1

http://en.wikipedia.org/wiki/Human_Genetic_Diversity:_Lewontin's_Fallacy http://westhunt.wordpress.com/2012/01/26/lewontins-argument/ See Secardic's "Race: a social destruction of a biological concept". You'll find a pdf by googling.

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by Click downvote670

The opposite is true, "race" is a purely social or political construction, coming from a time were nothing was known about population genetics, race is a concept used to promote segregation, exercise power to rule and exploit, oppression and genocide, based on mostly irrelevant phenotypes, coarse geographic borders, and other arbitrary differences. Comparable to phrenology. Its deconstruction does nothing but good for the scientific community and society in general.

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by Michael Dondrup46k

reading up on it, i might agree that there are good reasons to not use the term about humans, even though it seems hard from a biological perspective to contend that race is any less real in humans than eg dogs. gnxp has a good write-up that i agree with: http://blogs.discovermagazine.com/gnxp/2011/11/on-structure-variation-and-race/#.UhZEr2SshJU

Main points: 1) Human populations can be easily separated into plausible clusters using a random set of genetic markers

2) The differences between human populations are not trivial

ADD REPLYlink written 6.2 years ago by Click downvote670
  1. Dog races are a product of artificial selection. They have been produced by dog breeders, after gradually selecting individuals after many populations. Since this is not the case for humans, I don't think that it is the same to apply the concept of race to humans and dogs.
  2. Into how many clusters can human individuals be grouped? What is the logic for choosing the correct number of clusters?
  3. I think that social and environmental differences are far more important than the genetics. In any case, it has been shown that the differences between individuals of the same population are only slightly higher in number than the differences between individuals of two continental groups. See the case of Jim Watson, Craig Venter and Seong-Jin Kim : the latter is more similar to the other two, than the other two between each other.
ADD REPLYlink written 6.2 years ago by Giovanni M Dall'Olio26k

I'm not going to continue this boring discussion, but some clever people have argued that many human groups are the products of artificial selection too: see the http://en.wikipedia.org/wiki/The_10,000_Year_Explosion (which got a favorable review in Scientific American)

And comparing in-group and between-group differences is like comparing apples and oranges and seems to be a deceitful way of using statistics: http://evoandproud.blogspot.no/2011/11/apples-oranges-and-genes.html

Henry Harpendings comment is worth noting too: "Notice also that there is a subtle sleight-of-mind in the formulations of Lewontin's 'finding' (It was published previously by Luca Cavalli). Given that 15% of the diversity is among groups and 85% within, we are diploids so of the 85% half is between individuals and half is between alleles within people."

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by Click downvote670

This was an interesting discussion, thank you for participating to it.

ADD REPLYlink written 6.2 years ago by Giovanni M Dall'Olio26k

Where did you get these numbers? It is dangerous to play with genetics and races, without references.

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by Giovanni M Dall'Olio26k
3

I think it's a misinterpretation of the wikipedia article on "Race" (http://en.wikipedia.org/wiki/Race_(human_classification)#Genetically_differentiated_populations):

Population geneticist Sewall Wright developed one way of measuring genetic differences between populations known as the Fixation index, which is often abbreviated to FST. This statistic is often used in taxonomy to compare differences between any two given populations by measuring the genetic differences among and between populations for individual genes, or for many genes simultaneously.[79] It is often stated that the fixation index for humans is about 0.15. This translates to an estimated 85% of the variation measured in the overall human population is found within individuals of the same population, and about 15% of the variation occurs between populations. These estimates imply that any two individuals from different populations are almost as likely to be more similar to each other than either is to a member of their own group.[6][62] Richard Lewontin, who affirmed these ratios, thus concluded neither "race" nor "subspecies" were appropriate or useful ways to describe human populations.[80] However, others have noticed that group variation was relatively similar to the variation observed in other mammalian species.[81][82][83]

Note how it says "population" not "race". Imho the notion "race" should be avoided in a scientific context because I agree with:

In an ongoing debate, some geneticists[who?] argue that race is neither a meaningful concept nor a useful heuristic device,[84] and even that genetic differences among groups are biologically meaningless,[85] because more genetic variation exists within such races than among them, and that racial traits overlap without discrete boundaries.[86]

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by Michael Dondrup46k

Thank you very much for your answer, Michael. If you are interested in a discussion on the concept of the race in scientific terms, I can recommend you the book "Fatal Invention" by Dorothy Roberts.

ADD REPLYlink written 6.2 years ago by Giovanni M Dall'Olio26k

"fatal" does not sound like a scientifc term to me. i read about the book on amazon and it seems like Reductio ad Hitlerum for a few hundred pages to me. "Those who subscribe to the opinion that there are no human races are obviously ignorant of modern biology." Ernst Mayr.

ADD REPLYlink written 6.2 years ago by Click downvote670

I've read the book and it is not bad. It is true that it doesn't put scientists in a good position, but it does so in an objective way. We are talking about population genetics, a field which 100 years ago was called eugenics, and whose fathers studied how to improve the white race by means of applying laws. I liked the book because it exposes the history of population genetics in a objective (not scientific) way.

ADD REPLYlink written 6.2 years ago by Giovanni M Dall'Olio26k

Well, that's why I originally put "race" in quotes, but on further consideration, I realize that even that was not good enough. I've edited my post.

ADD REPLYlink written 6.2 years ago by Ryan Thompson3.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1279 users visited in the last hour