Question

Protocol for calling SNPs from a large number of BAM files

2

Entering edit mode

9.7 years ago

devenvyas ▴ 740

Fair warning, I am fairly noob at dealing with nuclear NGS data.

Background: I am molecular anthropology grad student. A little over a year ago, I got back a Illumina HiSeq 2000 data from 90 mitochondrial-enriched libraries. Another grad student in the lab got back the same kind of data from 92 NRY-enriched libraries, with some sample overlap. I now have 171 bam files that I have aligned to Hg19.

I am now trying to see what I can do with the "junk" data (i.e., the autosomal + X data). I want to see if there is enough good data to do some variant calling. I have SNP data from 64 samples at ~330,000 rs ids (the data is from an old set of HumanCNV370-Quads from 2008, I don't have the genomic coordinates). (there is some overlap between the genotyped individuals and the sequence). I was wondering, if anyone can give me some help/advice/suggestions.

I need to convert the BAM files to VCFs, get rsIDs into the VCF files, filter the VCF files based on quality and type (I am only interested in SNPs, not indels, microsat variation, etc.), and then see how much overlap there is between the actual SNP data.

I have an idea of how to convert the BAM files to VCF files, but beyond that I am lost.

Thanks!

-Deven

BAM VCF SNP • 2.9k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.7 years ago by devenvyas ▴ 740

Ram · Answer 1 · 2014-08-22

1

Entering edit mode

9.7 years ago

Zev.Kronenberg 12k

Check out GKNO. It WAS build for dealing with large datasets.

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.7 years ago by Zev.Kronenberg 12k

Ram · Answer 2 · 2014-08-22

0

Entering edit mode

9.7 years ago

devenvyas ▴ 740

I have asked UF's High Performance Computing folks to get it installed on the cluster (which could take a few days to get done). Is there anyway I can try out another method on one or two of the BAM files in the mean time? I've got 171 jobs running right now converting the BAMs to raw, unfiltered, un-rsid-ed VCF files.

ADD COMMENT • link 9.7 years ago by devenvyas ▴ 740

0

Entering edit mode

Yes. GKNO is a wrapper for the variant calling programs. You can check out the different variant callers while you wait to get the pipeline setup. Some of the more popular tools are: samtools and GATK. You can also have a look at this paper which will help you get aquainted with the methods.

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.7 years ago by Zev.Kronenberg 12k

0

Entering edit mode

So I've read through the article and supplemental, and I am still sufficiently confused. I know (or have an idea of) what tools are out there, what I am trying to figure out how to actually use them.

I have already used samtools, bcftools, and vcftools to get my BAM files into VCF files following this guide (http://ged.msu.edu/angus/tutorials-2012/snp_tutorial.html), and I was able to figure out how to filter out the non-autosomal sites (i.e., X, Y, & mt)

My concerns are now, how do I filter out all non-SNPs, how do I filter out low quality sites, and how do get the rsIDs for whatever's left in.

ADD REPLY • link 9.7 years ago by devenvyas ▴ 740