Question

Genomics/Computational Biology Jargon

0

Entering edit mode

7.3 years ago

biochemist87 • 0

Hello,

So I will be starting my third rotation in the next semester, and transitioning from experimental biology work to more computational based biology, where the investigator works with evolutionary genetics in the context of adaptation to environmental stress. I was assigned to do some reading for her lab, and one of the papers already has a bunch of jargon I don't know (thanks to my limited computational background).

They mention that there are, "35,468 transcripts from 29,143 unique loci", which to me sounds like there are alternative splice products, etc. The other thing is the mention of an N50, "When limiting the analyses to the longest transcript from each locus, the transcriptome size was 71,518,404 bp, with an N50 of 3,694 bp and a genome size of approximately 860 Mbp based on C-value estimates. The longest transcript was 66,752 bp in length, stemming from the gene coding for the largest known protein. This indicated that our analysis effectively captured even long transcripts present in the transcriptome."

There is also the mention of WGCNA (Weighted Gene Correlation Network Analysis), "Weighted gene correlation network analysis (WGCNA) of the top 10,000 expressed genes revealed 15 modules of coexpressed genes (fig. 3A). Ten of the 15 modules were significantly correlated with habitat type (presence or absence of H2S), with modules 5 and 10 exhibiting correlation coefficients >0.9 ".

Any help is great, and hoping that I will still want to pursue computational biology after this rotation.

transcripts unique loci N50 WGCNA • 1.3k views

ADD COMMENT • link updated 7.3 years ago by datascientist28 ▴ 560 • written 7.3 years ago by biochemist87 • 0

score 2 · Answer 1 · 2016-12-22

Google is always your friend:

N50 - A way to test the quality of an assembly. Given a set of contigs, each with its own length, the N50 length is defined as the shortest sequence length at 50% of the genome. So if your assembly has 4000 contigs, what's the length of the 2000th contig. There is a correlation between N50 and genome quality (although it is NOT absolute) https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics
WGCNA is a clustering algorithm from UCLA's Steve Horvath (https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/). He's also the author of the biological clock paper that's famous.
the transcripts to loci statement needs more connotation.