Question

Need help with creating a Variant Caller for 10x data

0

Entering edit mode

7.7 years ago

richardsw • 0

Hi folks,

First of all I'm an undergrad summer intern and this is my first time ever working in the field of bioinformatics, so I have no idea what I am doing. I am sure I am going to misuse many words in this post. I have been instructed to construct a variant caller to detect SNPs from a genome in comparison to a single reference genome. I am using 10x genomics sequenced BAM files ( http://www.10xgenomics.com/technology). These BAM files assign barcodes to reads. Using this barcode information, I have been able to assign small reads to larger molecules (up to 700k BP).

I understand the general Bayesian methods to detect variants, such as the freeBayes variant caller ( http://arxiv.org/abs/1207.3907). So basically, I have information that a bunch of reads belong to the same molecule, and thus the same chromosome. How can I use this information to help me with detecting SNP variants. I am happy to answer any questions about what I have written, or 10x sequencing technology.

Any ideas or insight would be extremely helpful. If I would be able to speak with or message somebody about this project, I would be very grateful. I am very lost and in over my head with this projects.

Thanks, Will

sequencing SNP variant 10x • 3.2k views

ADD COMMENT • link 7.7 years ago by richardsw • 0

0

Entering edit mode

Current title makes it sound like you are announcing a new variant caller in this post when you are actually looking for one for 10x genomics data. You should amend the title to reflect that need.

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

Good call, thank you!

ADD REPLY • link 7.7 years ago by richardsw • 0

0

Entering edit mode

Welcome to biostars. I assume you don't have to construct a new variant caller and are free to use an existing one. Comparing sequenced reads to the reference genome is indeed the common method to detect variants.

Specifically for this type of data this seems the most appropriate: http://www.10xgenomics.com/software/, but this employs GATK and/or FreeBayes.

ADD REPLY • link 7.7 years ago by WouterDeCoster 47k

0

Entering edit mode

Thanks! Unfortunately, I do have to write my own variant caller. Im sure I can leverage existing callers such as freeBayes, but I some how need to incorporate the information relating to barcodes I have generated so far. I'm sure it doesn't have to be state of the art or super efficient, but I was assigned to create my own variant caller for 10x data.

ADD REPLY • link 7.7 years ago by richardsw • 0

3

Entering edit mode

It is my understanding that barcodes are only for compartmentalizing initial data (which must have been done by 10x software already). I am not sure what kind of barcode information you are trying to incorporate in the variant calling.

There is an existing software suite that does most of what you have been tasked to do. This software is free (though not open source). Loupe visualization software requires a license.

ADD REPLY • link 7.7 years ago by GenoMax 141k

2

Entering edit mode

This is an ambitious project; you're going to want to take a step back and do it in steps. I'm assuming you know something about programming, and here are some steps that might help:

Work with SAM files instead of BAM; you can pipe to stdin on your program from samtools view or use samtools to convert your BAM to SAM.
Familiarize yourself with the SAM file specifications such that you are comfortable writing a SAM parsing script that can keep track of which read aligns where, in what orientation, and how they are related to other reads with respect to the reference genome.
Once you have your parser written, call ALL VARIANTS with respect to reference to begin with; you can apply probabilistic models later, but you need to at least be set up to call everything and then add code for exclusion/filtering.
Learn about probability models currently being used for variant calling, or do something simple and not state-of-the-art, such as something based on coverage and minor allele frequency.

You will be lost following freebayes source code unless you have lots of experience in C++. Erik Garrison is a knowledge and practiced developer, and it's not easy to follow well-developed code with no prior experience of the language standards or methods employed in the field.

ADD REPLY • link 7.7 years ago by Steven Lakin ★ 1.8k

0

Entering edit mode

Thank you very much Steven. I am experienced in programming and C++, and have been working on this project for a while. It is the genomics/bioinformatics part that I have no experience in. I have already done the first two steps you mentioned, and will take your advice for steps 3 and 4. I appreciate your advice!

ADD REPLY • link 7.7 years ago by richardsw • 0

1

Entering edit mode

Good stuff. Since you're farther along, some other papers to check out are the Samtools statistics paper, and this Nature paper (statistical section). Those, along with freebayes, are a good overview of current methods.

ADD REPLY • link 7.7 years ago by Steven Lakin ★ 1.8k