Question: Need help with creating a Variant Caller for 10x data
0
gravatar for richardsw
2.7 years ago by
richardsw0
richardsw0 wrote:

Hi folks,

First of all I'm an undergrad summer intern and this is my first time ever working in the field of bioinformatics, so I have no idea what I am doing. I am sure I am going to misuse many words in this post. I have been instructed to construct a variant caller to detect SNPs from a genome in comparison to a single reference genome. I am using 10x genomics sequenced BAM files ( http://www.10xgenomics.com/technology). These BAM files assign barcodes to reads. Using this barcode information, I have been able to assign small reads to larger molecules (up to 700k BP).

I understand the general Bayesian methods to detect variants, such as the freeBayes variant caller ( http://arxiv.org/abs/1207.3907). So basically, I have information that a bunch of reads belong to the same molecule, and thus the same chromosome. How can I use this information to help me with detecting SNP variants. I am happy to answer any questions about what I have written, or 10x sequencing technology.

Any ideas or insight would be extremely helpful. If I would be able to speak with or message somebody about this project, I would be very grateful. I am very lost and in over my head with this projects.

Thanks, Will

sequencing snp variant 10x • 1.1k views
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by richardsw0

Current title makes it sound like you are announcing a new variant caller in this post when you are actually looking for one for 10x genomics data. You should amend the title to reflect that need.

ADD REPLYlink written 2.7 years ago by genomax65k

Good call, thank you!

ADD REPLYlink written 2.7 years ago by richardsw0

Welcome to biostars. I assume you don't have to construct a new variant caller and are free to use an existing one. Comparing sequenced reads to the reference genome is indeed the common method to detect variants.

Specifically for this type of data this seems the most appropriate: http://www.10xgenomics.com/software/, but this employs GATK and/or FreeBayes.

ADD REPLYlink written 2.7 years ago by WouterDeCoster38k

Thanks! Unfortunately, I do have to write my own variant caller. Im sure I can leverage existing callers such as freeBayes, but I some how need to incorporate the information relating to barcodes I have generated so far. I'm sure it doesn't have to be state of the art or super efficient, but I was assigned to create my own variant caller for 10x data.

ADD REPLYlink written 2.7 years ago by richardsw0
3

It is my understanding that barcodes are only for compartmentalizing initial data (which must have been done by 10x software already). I am not sure what kind of barcode information you are trying to incorporate in the variant calling.

There is an existing software suite that does most of what you have been tasked to do. This software is free (though not open source). Loupe visualization software requires a license.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by genomax65k
2

This is an ambitious project; you're going to want to take a step back and do it in steps. I'm assuming you know something about programming, and here are some steps that might help:

  1. Work with SAM files instead of BAM; you can pipe to stdin on your program from samtools view or use samtools to convert your BAM to SAM.
  2. Familiarize yourself with the SAM file specifications such that you are comfortable writing a SAM parsing script that can keep track of which read aligns where, in what orientation, and how they are related to other reads with respect to the reference genome.
  3. Once you have your parser written, call ALL VARIANTS with respect to reference to begin with; you can apply probabilistic models later, but you need to at least be set up to call everything and then add code for exclusion/filtering.
  4. Learn about probability models currently being used for variant calling, or do something simple and not state-of-the-art, such as something based on coverage and minor allele frequency.

You will be lost following freebayes source code unless you have lots of experience in C++. Erik Garrison is a knowledge and practiced developer, and it's not easy to follow well-developed code with no prior experience of the language standards or methods employed in the field.

ADD REPLYlink written 2.7 years ago by Steven Lakin1.4k

Thank you very much Steven. I am experienced in programming and C++, and have been working on this project for a while. It is the genomics/bioinformatics part that I have no experience in. I have already done the first two steps you mentioned, and will take your advice for steps 3 and 4. I appreciate your advice!

ADD REPLYlink written 2.7 years ago by richardsw0
1

Good stuff. Since you're farther along, some other papers to check out are the Samtools statistics paper, and this Nature paper (statistical section). Those, along with freebayes, are a good overview of current methods.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Steven Lakin1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 691 users visited in the last hour