Hey everyone!
I launched a new chatbot today that is bioinformatics focused!
It's trained on bioinformatics content and should help debug / ideate much faster for you than vanilla ChatGPT. Check it out here: https://ai.tinybio.cloud/chat
Thanks!
Hey everyone!
I launched a new chatbot today that is bioinformatics focused!
It's trained on bioinformatics content and should help debug / ideate much faster for you than vanilla ChatGPT. Check it out here: https://ai.tinybio.cloud/chat
Thanks!
It's very eager to provide code examples with Biopython, but the examples veer into the nonsensical at times:
It also shows a weird obsession with bringing up something called the "150 Tomato Genome ReSequencing project" without prompting, to the extent that it hallucinates it into URLs and citations. But, maybe that's just an artifact of my prodding it with obscure edge cases. (I'd be curious if anybody else had that come up!)
On that note it is evidently capable of providing references that actually exist (looking at you, vanilla ChatGPT) but gets cagey when you point out that the references don't actually support the claim:
In the interest of full disclosure: I'm very skeptical of getting trustworthy output from generative large language models no matter the approach to training, though I'm trying to stay open-minded for potential use cases.
This is awesome prodding. Will definitely try to address this use case; especially it being more explicit with the citations back to papers. Also - very obvious that we need to add copy and paste functionality for everyone so that we don't need screenshots. Thanks for trying it out.
Thanks for sharing ! I asked about score calculation answer is quite good , though the last example seems not fully correct
May be you can consider to make a talk about that work for your community "sciBerloga" - see previous talks: https://www.youtube.com/@SciBerloga , anounces: https://t.me/sberlogabig , chat: https://t.me/sberlogabio
Looks like the URL got garbled in the link somehow (unless it's just me?) -- should be: https://www.nature.com/articles/d41586-023-01833-0
Update- June 28th, 2023 - We've added sources and chat history for our chat service!
Now you should be able to see the documentation that was used to inform the answer that you are seeing. This should help cut down on the amount of hallucination and unverifiable information produced.
Hey everyone - we're in the process of adding personalization features to the tool. Here is what we're thinking about right now:
General feedback and ideas are appreciated as well. We want to build the things that folks would actually find useful. We started a feature base board here (https://tinybio.featurebase.app/) for people to submit ideas to or reply to this answer.
I asked the same questions to tinyBio and chatGPT and chatGPT gave me better answers. This could be because chatGPT has been out for a while and learned more. This is a highly domain specific query by the way:
Question-1, tinyBio:
Same question, Chat GPT:
What tools can I use to preprocess raw reads that come from multiple species?
When dealing with raw reads from multiple species, there are several tools and approaches you can use for preprocessing. Here are some commonly used tools in bioinformatics:
Trimmomatic: Trimmomatic is a popular tool for trimming adapter sequences, removing low-quality bases, and filtering reads based on quality scores. It supports multiple sequencing platforms and can handle reads from different species.
Cutadapt: Cutadapt is another tool for adapter trimming and removing low-quality bases. It provides a flexible and customizable approach for handling reads from multiple species.
Fastp: Fastp is a fast and efficient tool for preprocessing next-generation sequencing data. It performs adapter trimming, quality filtering, and provides options for handling data from different species.
Seqtk: Seqtk is a toolkit for processing FASTQ files. It can perform various operations such as subsampling, trimming, and filtering reads based on quality scores. Seqtk is versatile and can handle reads from multiple species.
BBTools: BBTools is a suite of bioinformatics tools that includes several utilities for read preprocessing. It provides options for adapter trimming, read filtering, and quality control. BBTools is compatible with reads from different species.
Trimmomatic, Cutadapt, Fastp, Seqtk, and BBTools are command-line tools commonly used in bioinformatics. If you prefer a graphical user interface (GUI), you can consider tools like FastQC, which provides a comprehensive quality control analysis of sequencing data. FastQC is species-agnostic and can be used for analyzing reads from different organisms.
Remember that the choice of preprocessing tools may also depend on the specific requirements of your data, sequencing platform, and downstream analysis. It's always a good idea to explore documentation, tutorials, and user forums specific to the tools you choose to get the best results for your multi-species data.
BBTools is a relevant answer to my question.
Then, I got more specific:
Question-2, tinyBio:
Same question to ChatGPT:
What tool can I use to preprocess reads from xenograft samples?
When working with xenograft samples, which involve grafting tissues or cells from one species into another, you may encounter some challenges during read preprocessing. One common issue is the presence of reads from both the host species and the grafted species. To address this, you can utilize tools that can differentiate between the two species and selectively preprocess the reads. Here are a few tools commonly used for preprocessing xenograft samples:
Xenome: Xenome is a specialized tool designed specifically for processing xenograft sequencing data. It uses a two-step approach to classify reads into host and graft species based on alignment against reference genomes. Xenome allows you to separate the two species and perform subsequent preprocessing steps independently.
HISAT2 + StringTie: HISAT2 is a fast and accurate aligner that can align reads to both host and graft genomes simultaneously. By using HISAT2 followed by StringTie, you can identify and quantify gene expression from both species separately. This approach can be helpful when analyzing xenograft RNA-seq data.
Bowtie2 + Samtools: Bowtie2 is a popular aligner that can handle reads from multiple species. By aligning the reads to a combined reference genome that includes both host and graft sequences, you can separate the reads based on their alignment positions. Samtools can then be used to extract reads mapped to specific regions, allowing you to preprocess reads from the desired species.
Subread: Subread is a versatile read aligner that can handle reads from multiple species. It supports unique alignment modes and allows you to specify the desired reference genome for each species. Subread can be used to preprocess reads from xenograft samples by aligning them separately to the host and graft genomes.
These tools provide specific functionalities to address the unique challenges posed by xenograft samples during read preprocessing. Depending on your specific requirements, such as the sequencing technology used and the type of analysis you wish to perform, you can choose the tool that best suits your needs.
Xenome is the exact tool I use.
I tested it again today, on a bit of a meta level. I asked it for its opinion of biostars and it assumed I had some random context and went off on that context and after its detour, came back for a summary statement about biostars.
What do you think of biostars
Biostars is a great platform for bioinformaticians and researchers to ask questions, share knowledge, and discuss various topics related to bioinformatics. It is a valuable resource for both beginners and experienced professionals in the field.
Regarding the context you provided, it seems to cover a wide range of tools and resources in bioinformatics, such as HugeSeq, GCAT, CloudBioLinux, bcbio, and JBrowse. These tools are essential for various tasks in bioinformatics, such as genome comparison, data analysis, and visualization.
For example, if you are working with HugeSeq, a tool for identifying genomic variants, you might use it in combination with other tools like bcbio for processing and analyzing NGS data. Here's a simple example of how you might use HugeSeq:
Download and install HugeSeq
git clone https://github.com/StanfordBioinformatics/HugeSeq.git cd HugeSeq
Run HugeSeq on your input data
./hugeSeq.sh -i input_data.bam -o output_directory
Similarly, you can use JBrowse for genome visualization and exploration. Here's a basic example of setting up JBrowse:
Download and install JBrowse
git clone https://github.com/GMOD/jbrowse.git cd jbrowse ./setup.sh
Prepare your data for JBrowse
bin/prepare-refseqs.pl --fasta your_genome.fasta
Add a track for your annotations
bin/flatfile-to-json.pl --gff your_annotations.gff --trackLabel "My Annotations"
Open JBrowse in your web browser
firefox index.html
In conclusion, Biostars is a helpful platform for discussing and learning about various bioinformatics tools and resources. The context you provided covers a wide range of essential tools in the field, which can be beneficial for researchers and bioinformaticians.
I opened the URL and asked the question. I have no idea where it assumed some random context from.
Thank you for this work !
I have tested with a question about somatic variant detection and the proposed tools are not very relevant (GATK SomaticIndelDetector is deprecated), chatGPT still got the upper hand for this one.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
How is it trained on PubMed/papers? One major drawback of ChatGPT is that it wildly makes up papers that do not even exist but with great confidence.
Hoping to add functionality around this. Stay tuned. Big problem for our space that needs to be resolved.
That's an excellent point. A significant number of students are engaged in academic misconduct due to the use of fabricated references and extensive AI assistance in their work. I'm looking into the data with my colleagues at the moment to determine how big of a problem it really is.
Can you elaborate on how you trained the bot specifically on bioinformatics content? Where is the content coming from?
Hi GenoMax - it's primed on Github repos, some personality tuning, and some prompt assumptions based on the context of the site being bioinformatics/research focused. Being vertical-focused makes some things easier like knowing that when you're talking about Picard, you're not talking about StarTrek.
Since you have kept the bot openly accessible (at least for now) this is great. People can try it out and hopefully provide feedback. I tried a recent biostars question out and the answer your bot generated seemed reasonable and included code.
What's your business model?
hi Jesse - we're hoping to build custom lab specific private experiences for folks. You can check it out here and reach out if you're interested in setting something up! https://www.tinybio.cloud/paid-offerings/bioinformatics-trained-large-language-models
Great initiative :D