Question: Protocol for calling SNPs from a large number of BAM files
2
gravatar for devenvyas
6.1 years ago by
devenvyas650
Stony Brook
devenvyas650 wrote:

Fair warning, I am fairly noob at dealing with nuclear NGS data.

Background: I am molecular anthropology grad student. A little over a year ago, I got back a Illumina HiSeq 2000 data from 90 mitochondrial-enriched libraries. Another grad student in the lab got back the same kind of data from 92 NRY-enriched libraries, with some sample overlap. I now have 171 bam files that I have aligned to Hg19.

I am now trying to see what I can do with the "junk" data (i.e., the autosomal + X data). I want to see if there is enough good data to do some variant calling. I have SNP data from 64 samples at ~330,000 rs ids (the data is from an old set of HumanCNV370-Quads from 2008, I don't have the genomic coordinates). (there is some overlap between the genotyped individuals and the sequence). I was wondering, if anyone can give me some help/advice/suggestions.

I need to convert the BAM files to VCFs, get rsIDs into the VCF files, filter the VCF files based on quality and type (I am only interested in SNPs, not indels, microsat variation, etc.), and then see how much overlap there is between the actual SNP data.

I have an idea of how to convert the BAM files to VCF files, but beyond that I am lost.

Thanks!

-Deven

 

snp bam noob vcf • 2.0k views
ADD COMMENTlink modified 6.1 years ago • written 6.1 years ago by devenvyas650
1
gravatar for Zev.Kronenberg
6.1 years ago by
United States
Zev.Kronenberg11k wrote:

Check out GKNO.  It WAS build for dealing with large datasets:

http://gkno.me

ADD COMMENTlink written 6.1 years ago by Zev.Kronenberg11k
0
gravatar for devenvyas
6.1 years ago by
devenvyas650
Stony Brook
devenvyas650 wrote:

I have asked UF's High Performance Computing folks to get it installed on the cluster (which could take a few days to get done). Is there anyway I can try out another method on one or two of the BAM files in the mean time? I've got 171 jobs running right now converting the BAMs to raw, unfiltered, un-rsid-ed VCF files.

ADD COMMENTlink written 6.1 years ago by devenvyas650

Yes.  GKNO is a wrapper for the variant calling programs.  You can check out the different variant callers while you wait to get the pipeline setup.  Some of the more popular tools are : samtools and GATK.  You can also have a look at this paper which will help you get aquatinted with the methods:

http://bib.oxfordjournals.org/content/early/2013/01/21/bib.bbs086.full

ADD REPLYlink written 6.1 years ago by Zev.Kronenberg11k

So I've read through the article and supplemental, and I am still sufficiently confused. I know (or have an idea of) what tools are out there, what I am trying to figure out how to actually use them.

I have already used samtools, bcftools, and vcftools to get my BAM files into VCF files following this guide (http://ged.msu.edu/angus/tutorials-2012/snp_tutorial.html), and I was able to figure out how to filter out the non-autosomal sites (i.e., X, Y, & mt)

My concerns are now, how do I filter out all non-SNPs, how do I filter out low quality sites, and how do get the rsIDs for whatever's left in.

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by devenvyas650
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1263 users visited in the last hour