The Biostar Herald publishes user submitted links of bioinformatics relevance. It aims to provide a summary of interesting and relevant information you may have missed. You too can submit links here.
This edition of the Herald was brought to you by contribution from Istvan Albert, Wayne, and was edited by GenoMax, Istvan Albert,
DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer | Nature Biotechnology (www.nature.com)
Circular consensus sequencing with Pacific Biosciences (PacBio) technology generates long (10–25 kilobases), accurate ‘HiFi’ reads by combining serial observations of a DNA molecule into a consensus sequence. The standard approach to consensus generation, pbccs, uses a hidden Markov model. We introduce DeepConsensus, which uses an alignment-based loss to train a gap-aware transformer–encoder for sequence correction.
submitted by: Istvan Albert
I've developed a #python package for querying UniProt's new REST API! Maybe the first to fully support the new format. Check it out at https://t.co/tUtK0XI7vv.
In particular I've tried hard to integrate with Python tooling, giving you great code completion:#bioinformatics https://t.co/eVHEKgV4F1 pic.twitter.com/HMVaKEjPvR
— Michael Milton (@multimeric) August 3, 2022
I've developed a #python package for querying UniProt's new REST API! Maybe the first to fully support the new format. Check it out at https://t.co/tUtK0XI7vv.
In particular I've tried hard to integrate with Python tooling, giving you great code completion:#bioinformatics https://t.co/eVHEKgV4F1 pic.twitter.com/HMVaKEjPvR
Uniprot has been redesigned and a lot of code and approaches to access it need updating, example. This package, Unpiressed by Michael Milton, provides a way to use Python to programmatically to query the new Uniprot API.
submitted by: Wayne
https://translational-medicine.biomedcentral.com/articles/10.1186/s12967-021-02936-w
We compared TPM, FPKM, normalized counts using DESeq2 and TMM approaches, and we examined the impact of using variance stabilizing Z-score normalization on TPM-level data as well. We found that for our datasets, both DESeq2 normalized count data (i.e., median of ratios method) and TMM normalized count data generally performed better than the other quantification measures.
submitted by: Istvan Albert
When I typed up this R script, god and I were the only ones who knew how to interpret this jumbled mess. Now that I haven’t looked at it in 3 weeks, that information belongs to god and god alone.
— Megan Teig (@MeganTeig) August 30, 2022
When I typed up this R script, god and I were the only ones who knew how to interpret this jumbled mess. Now that I haven’t looked at it in 3 weeks, that information belongs to god and god alone.
— Megan Teig (@MeganTeig) August 30, 2022submitted by: Istvan Albert
GitHub - eelhaik/PCA_critique (github.com)
Github repository for the paper: Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated
submitted by: Istvan Albert
Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated | Scientific Reports (www.nature.com)
Principal Component Analysis (PCA) is a multivariate analysis that reduces the complexity of datasets while preserving data covariance. The outcome can be visualized on colorful scatterplots, ideally with only a minimal loss of information. [...] We analyzed twelve common test cases using an intuitive color-based model alongside human population data. We demonstrate that PCA results can be artifacts of the data and can be easily manipulated to generate desired outcomes.
submitted by: Istvan Albert
New Creative Commons licenses to reflect the needs of academics (a thread):
CC-NO: the data are open to make the funding agency happy but if you re-use them you're a freeloader
CC-AU: the data are open, and if you want to re-use them please kindly add 12 co-authors
— Timothée Poisot, Ph.D. (@tpoi) August 28, 2022
New Creative Commons licenses to reflect the needs of academics (a thread):
CC-NO: the data are open to make the funding agency happy but if you re-use them you're a freeloader
CC-AU: the data are open, and if you want to re-use them please kindly add 12 co-authors
submitted by: Istvan Albert
Identifying and correcting repeat-calling errors in nanopore sequencing of telomeres | Genome Biology | Full Text (genomebiology.biomedcentral.com)
Nanopore long-read sequencing is an emerging approach for studying genomes, including long repetitive elements like telomeres. Here, we report extensive basecalling induced errors at telomere repeats across nanopore datasets, sequencing platforms, basecallers, and basecalling models. We find that telomeres in many organisms are frequently miscalled. We demonstrate that tuning of nanopore basecalling models leads to improved recovery and analysis of telomeric regions, with minimal negative impact on other genomic regions.
submitted by: Istvan Albert
Want to get the Biostar Herald in your email? Who wouldn't? Sign up righ'ere: toggle subscription