Tool:ChatGPT optimized for bioinformatics questions
8
21
Entering edit mode
3 months ago
Sasha ▴ 720

Hey everyone!

I launched a new chatbot today that is bioinformatics focused!

It's trained on bioinformatics content and should help debug / ideate much faster for you than vanilla ChatGPT. Check it out here: https://ai.tinybio.cloud/chat

Thanks!

gpt • 3.9k views
ADD COMMENT
7
Entering edit mode

How is it trained on PubMed/papers? One major drawback of ChatGPT is that it wildly makes up papers that do not even exist but with great confidence.

ADD REPLY
3
Entering edit mode

Hoping to add functionality around this. Stay tuned. Big problem for our space that needs to be resolved.

ADD REPLY
1
Entering edit mode

That's an excellent point. A significant number of students are engaged in academic misconduct due to the use of fabricated references and extensive AI assistance in their work. I'm looking into the data with my colleagues at the moment to determine how big of a problem it really is.

ADD REPLY
5
Entering edit mode

Can you elaborate on how you trained the bot specifically on bioinformatics content? Where is the content coming from?

ADD REPLY
3
Entering edit mode

Hi GenoMax - it's primed on Github repos, some personality tuning, and some prompt assumptions based on the context of the site being bioinformatics/research focused. Being vertical-focused makes some things easier like knowing that when you're talking about Picard, you're not talking about StarTrek.

ADD REPLY
3
Entering edit mode

Since you have kept the bot openly accessible (at least for now) this is great. People can try it out and hopefully provide feedback. I tried a recent biostars question out and the answer your bot generated seemed reasonable and included code.

ADD REPLY
2
Entering edit mode

What's your business model?

ADD REPLY
2
Entering edit mode

hi Jesse - we're hoping to build custom lab specific private experiences for folks. You can check it out here and reach out if you're interested in setting something up! https://www.tinybio.cloud/paid-offerings/bioinformatics-trained-large-language-models

ADD REPLY
2
Entering edit mode

Great initiative :D

ADD REPLY
3
Entering edit mode
3 months ago
Jesse ▴ 720

It's very eager to provide code examples with Biopython, but the examples veer into the nonsensical at times:

wait what

It also shows a weird obsession with bringing up something called the "150 Tomato Genome ReSequencing project" without prompting, to the extent that it hallucinates it into URLs and citations. But, maybe that's just an artifact of my prodding it with obscure edge cases. (I'd be curious if anybody else had that come up!)

On that note it is evidently capable of providing references that actually exist (looking at you, vanilla ChatGPT) but gets cagey when you point out that the references don't actually support the claim:

references but not really

In the interest of full disclosure: I'm very skeptical of getting trustworthy output from generative large language models no matter the approach to training, though I'm trying to stay open-minded for potential use cases.

ADD COMMENT
2
Entering edit mode

This is awesome prodding. Will definitely try to address this use case; especially it being more explicit with the citations back to papers. Also - very obvious that we need to add copy and paste functionality for everyone so that we don't need screenshots. Thanks for trying it out.

ADD REPLY
3
Entering edit mode
3 months ago
Alexander ▴ 200

Thanks for sharing ! I asked about score calculation answer is quite good , though the last example seems not fully correct

enter image description here

ADD COMMENT
1
Entering edit mode

Awesome - will take a look!

ADD REPLY
0
Entering edit mode

May be you can consider to make a talk about that work for your community "sciBerloga" - see previous talks: https://www.youtube.com/@SciBerloga , anounces: https://t.me/sberlogabig , chat: https://t.me/sberlogabio

ADD REPLY
2
Entering edit mode
ADD COMMENT
1
Entering edit mode

Looks like the URL got garbled in the link somehow (unless it's just me?) -- should be: https://www.nature.com/articles/d41586-023-01833-0

ADD REPLY
1
Entering edit mode

Fixed now.

ADD REPLY
0
Entering edit mode

Jesse GenoMax i recall distinctly there was a leading space when I copied the URL, but on pasting I didnt see it - it was probably still there. Apologize for the confusion but thank you for helping!!! its an interesting, high-level perspective on this kind of work

ADD REPLY
1
Entering edit mode

Great article - thanks for sharing. We're working on making sure some of these things work out of the box in the coming weeks. Stay tuned!

ADD REPLY
1
Entering edit mode

Thats great!!! Is there a way I can follow along with you? I'd love to learn from the process.

Thank you!

VAL

ADD REPLY
2
Entering edit mode
3 months ago
Sasha ▴ 720

Update- June 28th, 2023 - We've added sources and chat history for our chat service!

Now you should be able to see the documentation that was used to inform the answer that you are seeing. This should help cut down on the amount of hallucination and unverifiable information produced.

ADD COMMENT
2
Entering edit mode
12 weeks ago
Sasha ▴ 720

Hey everyone - we're in the process of adding personalization features to the tool. Here is what we're thinking about right now:

  • have the tool know your workflow management practices (do you use SLURM/Nextflow/Snakemake?)
  • have the tool keep in mind that you're working on a specific project (ex. have the chatbot keep in mind that you're currently working on scRNA experiment for CD8 cells or whatever.)
  • have the tool keep in mind that you're only interested in feedback from a particular repo (let's say that you're working with snakemake so you're only interested in solutions from the snakemake repo.)
  • have it connected to your private Github repo so it keeps in mind the code that your lab/lab partners have written
  • have it keep in mind a specific review or standard when you're asking questions (ENCODE/Broad etc...)

General feedback and ideas are appreciated as well. We want to build the things that folks would actually find useful. We started a feature base board here (https://tinybio.featurebase.app/) for people to submit ideas to or reply to this answer.

ADD COMMENT
1
Entering edit mode
3 months ago
Ram 40k

I asked the same questions to tinyBio and chatGPT and chatGPT gave me better answers. This could be because chatGPT has been out for a while and learned more. This is a highly domain specific query by the way:

Question-1, tinyBio:

Q1, tinybio

Same question, Chat GPT:

What tools can I use to preprocess raw reads that come from multiple species?

When dealing with raw reads from multiple species, there are several tools and approaches you can use for preprocessing. Here are some commonly used tools in bioinformatics:

  1. Trimmomatic: Trimmomatic is a popular tool for trimming adapter sequences, removing low-quality bases, and filtering reads based on quality scores. It supports multiple sequencing platforms and can handle reads from different species.

  2. Cutadapt: Cutadapt is another tool for adapter trimming and removing low-quality bases. It provides a flexible and customizable approach for handling reads from multiple species.

  3. Fastp: Fastp is a fast and efficient tool for preprocessing next-generation sequencing data. It performs adapter trimming, quality filtering, and provides options for handling data from different species.

  4. Seqtk: Seqtk is a toolkit for processing FASTQ files. It can perform various operations such as subsampling, trimming, and filtering reads based on quality scores. Seqtk is versatile and can handle reads from multiple species.

  5. BBTools: BBTools is a suite of bioinformatics tools that includes several utilities for read preprocessing. It provides options for adapter trimming, read filtering, and quality control. BBTools is compatible with reads from different species.

  6. Trimmomatic, Cutadapt, Fastp, Seqtk, and BBTools are command-line tools commonly used in bioinformatics. If you prefer a graphical user interface (GUI), you can consider tools like FastQC, which provides a comprehensive quality control analysis of sequencing data. FastQC is species-agnostic and can be used for analyzing reads from different organisms.

Remember that the choice of preprocessing tools may also depend on the specific requirements of your data, sequencing platform, and downstream analysis. It's always a good idea to explore documentation, tutorials, and user forums specific to the tools you choose to get the best results for your multi-species data.

BBTools is a relevant answer to my question.

Then, I got more specific:

Question-2, tinyBio:

Q2, tinybio

Same question to ChatGPT:

What tool can I use to preprocess reads from xenograft samples?

When working with xenograft samples, which involve grafting tissues or cells from one species into another, you may encounter some challenges during read preprocessing. One common issue is the presence of reads from both the host species and the grafted species. To address this, you can utilize tools that can differentiate between the two species and selectively preprocess the reads. Here are a few tools commonly used for preprocessing xenograft samples:

  1. Xenome: Xenome is a specialized tool designed specifically for processing xenograft sequencing data. It uses a two-step approach to classify reads into host and graft species based on alignment against reference genomes. Xenome allows you to separate the two species and perform subsequent preprocessing steps independently.

  2. HISAT2 + StringTie: HISAT2 is a fast and accurate aligner that can align reads to both host and graft genomes simultaneously. By using HISAT2 followed by StringTie, you can identify and quantify gene expression from both species separately. This approach can be helpful when analyzing xenograft RNA-seq data.

  3. Bowtie2 + Samtools: Bowtie2 is a popular aligner that can handle reads from multiple species. By aligning the reads to a combined reference genome that includes both host and graft sequences, you can separate the reads based on their alignment positions. Samtools can then be used to extract reads mapped to specific regions, allowing you to preprocess reads from the desired species.

  4. Subread: Subread is a versatile read aligner that can handle reads from multiple species. It supports unique alignment modes and allows you to specify the desired reference genome for each species. Subread can be used to preprocess reads from xenograft samples by aligning them separately to the host and graft genomes.

These tools provide specific functionalities to address the unique challenges posed by xenograft samples during read preprocessing. Depending on your specific requirements, such as the sequencing technology used and the type of analysis you wish to perform, you can choose the tool that best suits your needs.

Xenome is the exact tool I use.

ADD COMMENT
0
Entering edit mode

Great use case - we'll take a look. Thanks for trying it out!

ADD REPLY
1
Entering edit mode

Happy to help!

ADD REPLY
1
Entering edit mode
3 months ago
Ram 40k

I tested it again today, on a bit of a meta level. I asked it for its opinion of biostars and it assumed I had some random context and went off on that context and after its detour, came back for a summary statement about biostars.

What do you think of biostars

Biostars is a great platform for bioinformaticians and researchers to ask questions, share knowledge, and discuss various topics related to bioinformatics. It is a valuable resource for both beginners and experienced professionals in the field.

Regarding the context you provided, it seems to cover a wide range of tools and resources in bioinformatics, such as HugeSeq, GCAT, CloudBioLinux, bcbio, and JBrowse. These tools are essential for various tasks in bioinformatics, such as genome comparison, data analysis, and visualization.

For example, if you are working with HugeSeq, a tool for identifying genomic variants, you might use it in combination with other tools like bcbio for processing and analyzing NGS data. Here's a simple example of how you might use HugeSeq:

Download and install HugeSeq

git clone https://github.com/StanfordBioinformatics/HugeSeq.git cd HugeSeq

Run HugeSeq on your input data

./hugeSeq.sh -i input_data.bam -o output_directory

Similarly, you can use JBrowse for genome visualization and exploration. Here's a basic example of setting up JBrowse:

Download and install JBrowse

git clone https://github.com/GMOD/jbrowse.git cd jbrowse ./setup.sh

Prepare your data for JBrowse

bin/prepare-refseqs.pl --fasta your_genome.fasta

Add a track for your annotations

bin/flatfile-to-json.pl --gff your_annotations.gff --trackLabel "My Annotations"

Open JBrowse in your web browser

firefox index.html

In conclusion, Biostars is a helpful platform for discussing and learning about various bioinformatics tools and resources. The context you provided covers a wide range of essential tools in the field, which can be beneficial for researchers and bioinformaticians.

I opened the URL and asked the question. I have no idea where it assumed some random context from.

ADD COMMENT
0
Entering edit mode

Thanks for testing this. The service right now is definitely set up to be more verbose. Will work on adjusting this.

ADD REPLY
1
Entering edit mode
3 months ago
guillaume.rbt ★ 1.0k

Thank you for this work ! I have tested with a question about somatic variant detection and the proposed tools are not very relevant (GATK SomaticIndelDetector is deprecated), chatGPT still got the upper hand for this one. somatic_variant_question

ADD COMMENT
1
Entering edit mode

Yea - this is an issue! Need to figure out a way to update to only current standards.

ADD REPLY

Login before adding your answer.

Traffic: 1394 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6