The Biostar Herald publishes user submitted links of bioinformatics relevance. It aims to provide a summary of interesting and relevant information you may have missed. You too can submit links here.
This edition of the Herald was brought to you by contribution from Jeremy Leipzig, Istvan Albert, and was edited by Istvan Albert,
Commonly used software tools produce conflicting and overly-optimistic AUPRC values | Genome Biology | Full Text (genomebiology.biomedcentral.com)
The precision-recall curve (PRC) and the area under the precision-recall curve (AUPRC) are useful for quantifying classification performance. They are commonly used in situations with imbalanced classes, such as cancer diagnosis and cell type annotation. We evaluate 10 popular tools for plotting PRC and computing AUPRC, which were collectively used in more than 3000 published studies. We find the AUPRC values computed by the tools rank classifiers differently and some tools produce overly-optimistic results.
submitted by: Istvan Albert
x.com (twitter.com)
submitted by: Istvan Albert
Flawed machine-learning confounds coding sequence annotation | bioRxiv (www.biorxiv.org)
Detecting protein coding genes in genomic sequences is a significant challenge for understanding genome functionality, yet the reliability of bioinformatic tools for this task remains largely unverified. This is despite some of these tools having been available for several decades, and being widely used for genome and transcriptome annotation. We perform an assessment of nucleotide sequence and alignment-based de novo protein-coding detection tools. The controls we use exclude any previous training dataset and include coding exons as a positive set and length-matched intergenic and shuffled sequences as negative sets. Our work demonstrates that several widely used tools are neither accurate nor computationally efficient for the protein-coding sequence detection problem.
submitted by: Istvan Albert
TileDB newsletter - May 2024 (tiledb.com)
TileDB has several May 2024 life sciences announcements
Population genomics (TileDB-VCF)
- one-click ingestion of VCF datasets released
- QC sample stats computed
- Sars-CoV2 (5.2M samples) & dog genomes (1161 dogs) public TileDB-VCF datasets released. Scientists in the virology and companion animals domains are encouraged to get a free account and collaborate with us to publish analyses on these datasets.
Single-cell (TileDB-SOMA)
TileDB-SOMA now supports block-processing of data
Biomedical imaging (TileDB-Bioimaging)
TileDB Bioimaging now supports ingestion of NDPI images
submitted by: Jeremy Leipzig
Empowering bioinformatics communities with Nextflow and nf-core | bioRxiv (www.biorxiv.org)
The recent development of Nextflow Domain-Specific Language 2 (DSL2) allows pipeline components to be shared and combined across projects. The nf-core community has harnessed this with a library of modules and subworkflows that can be integrated into any Nextflow pipeline, enabling research communities to progressively transition to nf-core best practices.
submitted by: Istvan Albert
Highly accurate metagenome-assembled genomes from human gut microbiota using long-read assembly, binning, and consolidation methods | bioRxiv (www.biorxiv.org)
We performed a deep-sequencing experiment using PacBio HiFi reads to obtain metagenome-assembled genomes (MAGs) from a pooled human gut microbiome. [...] Based on strict similarity scores, we found 125 MAGs were unequivocally shared across the assembly methods at the strain level, representing ∼22% of the total MAGs recovered per method. Finally, we detected more total viral sequences in the metaMDBG assembly versus the hifiasm-meta assembly (∼6,700 vs. ∼4,500). Overall, we find the use of HiFi sequencing, improved metagenome assembly methods, and complementary binning strategies is highly effective for rapidly cataloging microbial genomes in complex microbiomes.
submitted by: Istvan Albert
Want to get the Biostar Herald in your email? Who wouldn't? Sign up righ'ere: toggle subscription