Question: How Can I Deal With 50 Very Big .Bam Files Without Rg For Snp Calling?
1
gravatar for Chris
7.3 years ago by
Chris40
Chris40 wrote:

Hi, every friends As described, I have 50 mapped big .bam files (human exome,50 individual, 3GB average) which have no RGs. So, I want to use Picard AddOrReplaceReadGroups to add RGs. The Question:

1:For each,for example, RGID=(1,2,3..50) RGLB=(Lb.1,2,3..50) ( RGPL=ILLUMINA RGSM=(Tibet1,2,3..50). Is my operation on adding different RG to .bam file RIGHT?

2:After geting 50 new RG-adding bams, I will use GATK to do the Base quality score recalibration and Local realignment. Should I do this 50times for every bam?? Or can I merge the 50 bams into a sigle one to do this or the downstream analysis like SNP calling? If can, how to merge and what's the Notice?

And can sb tell me how The 1000 Genomes do this? As this project has large amounts of data.

3:If not, it means I must get other 100 new bams, 400-500 GB total, 50 in GATK -T TableRecalibration and 50 in IndelRealigner process in BQSR and Local realignment.The computational cost is extreme sad. Is there another way?

4:Like the process BQSR and Local realignment and VQSR, we have known vcf to use in human, but if the data comes from other species which have no known vcf, then how can I process these parameter(example:-knownsite)??

Appreciate your timely reply! Thanks!

gatk snp • 1.9k views
ADD COMMENTlink modified 7.3 years ago by Arun2.3k • written 7.3 years ago by Chris40
1
gravatar for Arun
7.3 years ago by
Arun2.3k
Germany
Arun2.3k wrote:

For the 1) yes, it seems right. GATK FAQ should provide some info on this: http://www.broadinstitute.org/gsa/wiki/index.php/Frequently_Asked_Questions#What.27s_the_meaning_of_the_standard_read_group_fields
2) You would want to merge the BAM files. That's one of the purposes of having unique read IDs. You can use picard-tools MergeSamFiles to merge all files together. They work on both SAM and BAM files. And BuildBamIndex to create bam index file (bai).
3) Sorry, I don't understand.
4) I/we created a script to get it to VCF format. If you look at the VCF format, it require 8 mandatory columns and we wrote a script to do it. Its very straightforward but just consumes time to write the script to convert our files.

ADD COMMENTlink modified 7.3 years ago • written 7.3 years ago by Arun2.3k

Sincere thanks! So,you mean I can merge them into one sigle big bam to apply the downstream analysis, then, the 3rd question has gone. The 4th, I mean not the what is the format of vcf but the usage. For example,-T IndelRealigner [--known /path/to/indels.vcf], what if I have no known vcf file for this species but I want to use this parameter?

ADD REPLYlink written 7.3 years ago by Chris40

Yes. you can use samtools to obtain SNPs that you could consider as a preliminary call and pass it as input to GATK. See this: http://samtools.sourceforge.net/mpileup.shtml

ADD REPLYlink written 7.3 years ago by Arun2.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2205 users visited in the last hour