How to create vcf file with SNPs bed file and sorted bam file?
1
1
Entering edit mode
3.4 years ago
O.rka ▴ 710

I found this post but I'm still not sure how to do this: vcf from sorted bam

Here's my genome file and index:

../../db/genomes/human/GRCh38.p13/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

../../db/genomes/human/GRCh38.p13/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai

Here's my bam file:

201858321_S1.trimmomatic.GCA_000001405.15_GRCh38_no_alt_analysis_set.sorted.bam

Here's my SNPs file:

hg38.snp147.bed

How can I generate a VCF file that uses annotations from the hg38.snp147.bed and my bam file?

RNA-Seq genome variant • 1.6k views
ADD COMMENT
0
Entering edit mode

Are you familiar with the process of variant calling?

ADD REPLY
0
Entering edit mode

A bit. I've only really done it for prokaryotic organisms with snippy. Right now I'm trying the following:

source activate vcftools_env && pv 201858321_S1.trimmomatic.GCA_000001405.15_GRCh38_no_alt_analysis_set.sorted.bam | samtools view -h | bcftools mpileup -f ../../db/genomes/human/GRCh38.p13/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --threads ${N_JOBS} - | gzip -c > 201858321_S1.trimmomatic.GCA_000001405.15_GRCh38_no_alt_analysis_set.sorted.vcf.gz

I thought that pv would give me a progress bar (it's not) or else I wouldn't have used stdin for bcftools.

I understand the mpileup will generate a vcf file but I'm wondering if there is a way to include annotations for SNPs so I can get rs identifier flags? If not my plan was to convert to bed file and then compare bed files.

I don't really understand the difference between the different VCF files on slide 2 here: https://www.ebi.ac.uk/sites/ebi.ac.uk/files/content.ebi.ac.uk/materials/2014/140217_AgriOmics/dan_bolser_snp_calling.pdf

ADD REPLY
0
Entering edit mode
3.4 years ago
Qiongyi ▴ 180

Hi O.rka, I think it is better that you learn how to run GATK (i.e. the process of variant calling) to get this task done. You need to follow the GATK best practices.

To include annotations from your SNP file, you can either write a customised script to add it after the variant calling step or add your SNP file in the "--dbsnp" option. I currently only have an example using GATK3. Just for your reference, I pasted it below. The important thing is that you need to follow the GATK best practices to properly call the variants from your bam file.

java -jar $ev{"GATK"}/GenomeAnalysisTK.jar -l INFO \
                    -T HaplotypeCaller \
                    -I $input_BAM \
                    -R $ev{"REF"} \
                    -nct 2 \
                    --dbsnp $ev{"dbsnp_VCF"} \
                    --out $out_raw/raw.vcf \
                    -stand_call_conf 30.0 \
                    -stand_emit_conf 10.0 \
                    -L $bed);
ADD COMMENT
0
Entering edit mode

I've always heard of GATK but I haven't had the pleasure to try it out yet. I just installed 2 conda environments one with v3.8-1-0-gf15c1c3ef and v4.1.9.0. I didn't see the --dbsnp option in either of them. What is that command? This is very helpful. Do you think you could help me figure out how to do this with v4? If not, maybe v3?

ADD REPLY
0
Entering edit mode

In my previous reply, I pasted the GATKv3 HaplotypeCaller command for you reference with the "--dbsnp" option, but you still couldn't find the "--dbsnp" option. I really don't know what to say next... As I mentioned before, I highly recommend you read the GATK document.

Since you have BAM file, you may refer to the below general guidance:

# Mark duplicates using Picard MarkDuplicates;
# GATK Indel realignment (optional);
# GATK BQSR;
# Call variants (or generate gvcf file) using GATK HaplotypeCaller;
# If you generate gvcf file for each sample, then you may need GATK CombineGVCFs & GATK GenotypeGVCFs to perform multiple sample variants calling;
# Filter variants either with VQSR or by hard-filtering.
ADD REPLY
0
Entering edit mode

I will try the commend above. The reason I asked about the --dbsnp argument is because I don't know what to use for this. Is this an input vcf file (https://gatk.broadinstitute.org/hc/en-us/community/posts/360062537671-GATK-4-1-7-0-does-not-annotate-ID-using-dbSNP-build-153-VCF)? Or is this the .txt.gz file from http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/${VERSION}.txt.gz. If the latter, then what should be used for the bed file? The documentation https://gatk.broadinstitute.org/hc/en-us/articles/360036509212-HaplotypeCaller is not very clear in terms of formatting.

ADD REPLY
0
Entering edit mode

The reason I asked about the --dbsnp argument is because I don't know what to use for this. Is this an input vcf file (https://gatk.broadinstitute.org/hc/en-us/community/posts/360062537671-GATK-4-1-7-0-does-not-annotate-ID-using-dbSNP-build-153-VCF)?

Yes - it is a VCF file. You can download the official dbsnp file from GATK resource bundle (https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle) and have it a check. Then you can reformat you SNP file to VCF and give it to HaplotypeCaller.

ADD REPLY

Login before adding your answer.

Traffic: 2013 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6