426 results • Page 1 of 15
Concepts in phased assembly: Contig: a contiguous sequence in an assembly. A contig does not contain long stretches of unknown sequences (aka assembly gaps). Scaffold: a sequence consists of one or multiple contigs connected by assembly gaps of typically inexact sizes. A scaffold is also called a supercontig, though this terminology is rarely used nowadays. Haplotig: a contig that comes from the same haplotype. In an unphased assembly, a contig may join alleles from different parental haplotypes in a diploid or polyploid genome. Switch error: a change from one parental allele to another parental allele on a contig (see the figure below). This terminology has been used for measuring reference-based phasing accuracy for two decades. A haplotig is supposed to have no switch errors. Yak hamming error: an allele not on the most supported haplotype of a contig (see the figure below). Its main purpose is to test how close a contig is to a haplotig. This definition is tricky. The terminology was perhaps first used by Porubsky et al (2017) in the context of reference-based phasing. However, adapting it for contigs is not straightforward. The yak definition is not widely accepted. The hamming error rate is arguably less important … go to blog
If you haven’t heard about Clubhouse yet… well, it’s the latest Silicon Valley unicorn, and the popular new chat hole for thought leaders. I heard about it for the first time a few months ago, and was kindly offered an invitation (Club house is invitation only!) so I could explore what it is all about. […] go to blog
Our new editorial on equity, diversity, and inclusion in data science is out in BioData Mining. go to blog
It seems like this discussion comes up a lot. Choosing a differential expression (DE) tool could change the results and conclusions of studies. How much depends on the strength of the data. Indeed this has been coveres by others already (here, here).In this post I will compare these tools for a particular dataset that highlights the different ways these algorithms perform. So you can try it out at home, I've uploaded the code for this to GitHub here. To get it to work, clone the repository and open the Rmarkdown (.Rmd) file in Rstudio. You will need to install Bioconductor packages edgeR, DESeq2 and the CRAN package eulerr (for the nifty venn diagrams). Once that's done you can start working with the Rmd file. For an introduction to Rmd read on here. Suffice to say that it's the most convenient way to generate a data report that includes code, results and descriptive text for background information, interpretation, references and the rest. To create the report in Rstudio, click the "knit" button and it will execute the code and you'll get a HTML report. If you're working in the R console, for example if you are connecting to a server via … go to blog
Theophilus Painter (1889–1969) joined the faculty at the University of Texas in 1916 and except for a military stint during World War I stayed there his whole career. He was, in succession, instructor, associate professor, professor, distinguished professor, acting president (1944–1946), and president (1946–1952) of the University of Texas.These days, he is remembered for three things (1) erroneously claiming that humans possess 48 chromosomes—an error that plagued human cytogenetics for 33 years, (2) his willingness to curry favor with corrupt politicians in Texas at the expense of academic freedom, and (3) his support of racist practices. The number of human chromosomesProbably the first effort to determine the chromosome number of humans was that of Hansemann (1891), who counted 18, 24 and more than 40 chromosomes in three cells. In between 1891 and 1932, many investigators published papers reporting the chromosome number in humans. With two exceptions, the counts were low, ranging from 8 to 24 as the diploid number. However, for a long time the most influential paper in human cytogenetics was that of van Winiwarter* (1912), who reported a chromosome number of 47 in males and 48 in females. He concluded that humans, like locusts, had an XX/X0 male … go to blog
COVID-19 was and remains a major crisis in many countries, disrupting general life as well as scientific research. But how has it impacted scientific output in genomics?To evaluate this I investigated the number of papers published in PubMed Central (PMC) in the period from 2016 through 2020. I used total number of papers as well as those matching the genomics search term with the approach below:(genom*[Abstract]) AND ("2020"[Publication Date] : "2020"[Publication Date]) Here are the number of papers and genomics papers published annually over this period.What you can see is that genomics experienced a major fall in number of papers appearing in PMC in 2020 while total papers did not.Indeed 2020 was the only year since 2000 that the number of published genomics papers has actually gone down compared to the previous yearThere are a lot of ways to predict what amount of output was lost in the genomics field in 2020. We might have expected growth something like the mean year-on-year growth over the past 5 years (7.5%) meaning we should have expected the number of papers to be ~30550. But instead it was -3.6% lower than the year before. This would suggest that the pandemic (and maybe other … go to blog
Romano JD, Moore JH. Ten simple rules for writing a paper about scientific software. PLoS Comput Biol. 2020 Nov 12;16(11):e1008390. doi: 10.1371/journal.pcbi.1008390. PMID: 33180774; PMCID: PMC7660560. [PubMed] [PLoS Comp Bio]AbstractPapers describing software are an important part of computational fields of scientific research. These "software papers" are unique in a number of ways, and they require special consideration to improve their impact on the scientific community and their efficacy at conveying important information. Here, we discuss 10 specific rules for writing software papers, covering some of the different scenarios and publication types that might be encountered, and important questions from which all computational researchers would benefit by asking along the way. go to blog
I get asked a lot about the best ways to store sequence data because the files are massive and researchers have various levels of knowledge of the hardware and software. Here I'll run through some best practices for genomics research data management based on my 10 years of experience in the space.1. Always work on servers, not remote machines or laptopsOn-prem machines and cloud servers are preferred because you can log into the from anywhere using ssh or other protocol. These machines are better suited to heavy loads and are less likely to breakdown because of the institutional tech support and maintenance. Institutional data transfer speeds will be far superior to your home network. Never do computational work on a laptop. Avoid storing data on your own portable hard drives or flash drives. If you don't have a server, ask for access at your institution or research cloud provider (we use Nectar in Australia).2. Download the data to the place where you will be working on it the most. The raw sequencing data should be downloaded in a project subfolder called "fastq" or similar. I recommend using a command line tool because these are better suited to really large files. … go to blog
Ian Holmes has a twitter poll right now on the use of “SNP” (single-nucleotide polymorphism) versus “SNV” (single-nucleotide variant). I have been bugged by the two terminologies for years, so I decided to write a blog post on it. Personally, I use “SNP” for germline events and “SNV” for somatic events, but I understand others think differently. Here are my thoughts. The wiki page for SNP defines a SNP as a nucleotide change “that is present in a sufficiently large fraction of the population (e.g. 1% or more)”. However, such a frequency-based definition is not actionable in practice. Allele frequency varies a lot across populations. Due to genetic drift and selection, an allele at 5% frequency in African may be absent from the rest of the world. Is this a SNP or not? Furthermore, the observed allele frequency fluctuates with sampling and the sample size. An allele at 2% frequency in the 1000 Genomes Project (1KG) may become 0.5% in gnomAD. Is this a SNP or not? If it is impractical to set a frequency threshold, the definition of SNP shouldn’t require a frequency threshold. Historically, we have been using “SNP” without a frequency threshold for decades. If you search … go to blog
It’s been a while. I hope you are all well. Shall we make some charts? About this time last year, one of my life-long dreams came true when I was told that I could work from home indefinitely. One effect of this – I won’t say downside – is that I don’t get through as … Continue reading Florence Nightingale’s “rose charts” (and others) in ggplot2 go to blog
This software company did something that I didn't expect... putting the variant calling on the phone itself .. "The variant calling is run directly on the phone, extracting the data from the file on your phone, processed on the phone, and only at the end the VCF file could be shared in the cloud for annotation and reporting by an accredited physician," Ascari said. The physician is necessary to assure that the result is of "diagnostic quality," he added.I honestly expected CRAM & cloud based calling with encrypted exchange (made feasible with 5G mobile network) of bam reads with an app store like customised reports for what you want to know about your genetics. Other than security concerns, I don't see why I would want variant calling on the phone though. Previous postsIonCRAM: a reference-based compression tool for ion torrent sequence files Future of Genomics: 10 bold predictionsWhat 5G mobile networks portends for the future of personal genomics go to blog
More spatial profiling news coming in from AGBT -- Harvard spin-out VizGen is launching in the U.S. an instrument implementing MERFISH technology. This sub-$300K instrument will initially enable panels of up to 500 genes to be profiled, with plans to expand that capacity to 1000. Users either pick from a menu of pre-designed panels or select genes using a Gene Panel Design Tool and VizGen would proceed to manufacturing the panel in around two weeks. VizGen CEO Terry Lo and Senior Director of Marketing Brittany Auclair were kind enough to give me a preview last Friday.Read more » go to blog
Rant is ON! I've been having an utterly miserable experience with the LabRoots conference software that AGBT is using for their virtual meeting. This year has exposed many of us to a wide variety of teleconference and virtual meeting software and many of the glitches are small and hard to pin down. Or matters of personal preference (though if you don't share mine, you are simply wrong!). But now on two major platforms I've come across major issues with LabRootsRead more » go to blog
My prediction that spatial would be a hot topic at AGBT was easy to make knowing I was sitting on embargoed news in the spatial space. This morning Rebus Biosystems announced the launch of the Rebus Esper system for wide field spatial profiling of gene panels with subcellular resolution. Rebus is promising that this instrument will offer true walkaway automation from fluidics through imaging, and data processing, requiring only one hour of hands-on time.Read more » go to blog
Getting some miscellanea out before AGBT21 starts later this morningRead more » go to blog
This year marks the 20th anniversary of the publication of the human genome reference sequence. As I enjoy recounting to people outside of the genomics field, the investment required to complete that initial assembly is staggering: ten years, dozens of laboratories, hundreds of sequencing instruments, and a billion dollars. Today, using the latest next-generation sequencing, […] The post Genome Reference: Moving to Build 38 appeared first on KidsGenomics. go to blog
I'll call it now -- the big buzz at this year's AGBT will be around spatial profiling. Trust me, it's not just a hunch. The two current players in the field -- nanoString and 10X Genomics -- both have significant presence in the virtual conference. Don't be surprised to see more players on the field -- just sayin'Read more » go to blog
Pacific Biosciences continued its roll of successful business development, snagging $900M from Japan's SoftBank two weeks ago. Combined with a recent secondary stock offering and a major deal with Invitae, PacBio has gone from their self-proclaimed near-derelict status during the Illumina acquisition attempt saga to rolling in cash.Read more » go to blog
10X Genomics had an online event Wednesday called Xperience (as far as I could tell no Jimmy Hendrix music was used, a missed opportunity!) to lay out their development roadmap. This largely paralleled the presentation given at J.P. Morgan, but there were a few new bits and of course much more technical detail to whet the appetites of scientists -- and judging from a number of very positive tweets I saw today they were successful in that goal. Some of the 10X management was kind enough to walk me through the deck earlier this week as well as permission to borrow images from it, so this summary is based on that as well as watching the presentation. While their name is 10X, the company emphasized progress on three axes: scale, resolution and access and that progress across the three different platforms. Read more » go to blog
Presented 18th February 2021AbstractGene expression is governed by numerous chromatin modifications. Understanding these dynamics is critical to understanding human health and disease, but there are few software options for researchers looking to integrate multi-omics data at the level of pathways. To address this, we developed mitch, an R package for multi-contrast gene set enrichment analysis. It uses a rank-MANOVA statistical approach to identify sets of genes that exhibit joint enrichment across multiple contrasts. In this talk I will demonstrate using mitch and showcase its advanced visualisation features to explore the regulation of signaling and biochemical pathways at the chromatin level. go to blog
There's a question that others pop my way pretty much every year around J.P. Morgan: would I ever attend myself? I'll confess it never occurred to me before I was asked, but that isn't necessarily a deal breaker. I foolishly didn't attend AGBT until 2013 when Alexis Borisy (then CEO of Warp Drive) suggested I go -- I think it was mostly because he thought it was a good investment and probably only secondarily to keep me off the ski slopes for a week -- I shattered my knee just after AGBT 2012 ended. It's an interesting but complex question which I will answer one way here, but freely admit that over coffee I could be nudged one way or the other.Read more » go to blog
I claimed in my Miscellanea piece that I was one post away from being done with J.P. Morgan -- oops, forgot I had drafted a minor screed on data display which I'll push out before the last piece - particularly since I hinted I would be taking Genapsys to task on this subject. Unexpectedly good timing too: maybe new Genapsys CEO Jason Myer's first big initiative can be to fix this plot!Read more » go to blog
Before J.P. Morgan is truly a month ago I should clean up some loose ends as a penultimate post driven by this year's virtual conference (the last post isn't exactly time sensitive). In contrast to the single company focused items that preceded it, this is a grab bag of minor observations and notes.Read more » go to blog
Manduchi E, Fu W, Romano JD, Ruberto S, Moore JH. Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses. BMC Bioinformatics. 2020 Oct 1;21(1):430. doi: 10.1186/s12859-020-03755-4. PMID: 32998684; PMCID: PMC7528347. [PubMed] [BMC Bioinformatics]AbstractBackground: A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis.Results: We developed an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids 'leakage' during the cross-validation training procedure. We describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj.Conclusions: In this work, … go to blog
Moore JH. Ten important roles for academic leaders in data science. BioData Min. 2020 Oct 26;13:18. doi: 10.1186/s13040-020-00228-5. PMID: 33117434; PMCID: PMC7586691. [PubMed] [BioData Mining]AbstractData science has emerged as an important discipline in the era of big data and biological and biomedical data mining. As such, we have seen a rapid increase in the number of data science departments, research centers, and schools. We review here ten important leadership roles for a successful academic data science chair, director, or dean. These roles include the visionary, executive, cheerleader, manager, enforcer, subordinate, educator, entrepreneur, mentor, and communicator. Examples specific to leadership in data science are given for each role. go to blog
La Cava W, Williams H, Fu W, Vitale S, Srivatsan D, Moore JH. Evaluating recommender systems for AI-driven biomedical informatics. Bioinformatics. 2020 Aug 7:btaa698. doi: 10.1093/bioinformatics/btaa698. Epub ahead of print. PMID: 32766825. [PubMed] [Bioinformatics]AbstractMotivation: Many researchers with domain expertise are unable to easily apply machine learning to their bioinformatics data due to a lack of machine learning and/or coding expertise. Methods that have been proposed thus far to automate machine learning mostly require programming experience as well as expert knowledge to tune and apply the algorithms correctly. Here, we study a method of automating biomedical data science using a web-based platform that uses AI to recommend model choices and conduct experiments. We have two goals in mind: first, to make it easy to construct sophisticated models of biomedical processes; and second, to provide a fully automated AI agent that can choose and conduct promising experiments for the user, based on the user's experiments as well as prior knowledge. To validate this framework, we experiment with hundreds of classification problems, comparing to state-of-the-art, automated approaches. Finally, we use this tool to develop predictive models of septic shock in critical care patients.Results: We find that matrix factorization-based recommendation systems outperform meta-learning methods … go to blog
Li R, Chen Y, Ritchie MD, Moore JH. Electronic health records and polygenic risk scores for predicting disease risk. Nat Rev Genet. 2020 Aug;21(8):493-502. doi: 10.1038/s41576-020-0224-1. Epub 2020 Mar 31. PMID: 32235907. [PubMed] [Nature Reviews]AbstractAccurate prediction of disease risk based on the genetic make-up of an individual is essential for effective prevention and personalized treatment. Nevertheless, to date, individual genetic variants from genome-wide association studies have achieved only moderate prediction of disease risk. The aggregation of genetic variants under a polygenic model shows promising improvements in prediction accuracies. Increasingly, electronic health records (EHRs) are being linked to patient genetic data in biobanks, which provides new opportunities for developing and applying polygenic risk scores in the clinic, to systematically examine and evaluate patient susceptibilities to disease. However, the heterogeneous nature of EHR data brings forth many practical challenges along every step of designing and implementing risk prediction strategies. In this Review, we present the unique considerations for using genotype and phenotype data from biobank-linked EHRs for polygenic risk prediction. go to blog
A 12-minute overview of my artificial intelligence and machine learning research program [YouTube] go to blog
426 results • Page 1 of 15
Traffic: 1460 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6