Question: Pulling out the reads associated with a vcf line entry.
gravatar for TonyCN
2.2 years ago by
UK/Oxford/University of Oxford
TonyCN30 wrote:

I would appreciate some help. I'm a bit of a newbie in terms of bioinformatics and I am working remotely from a team, so turning to Biostars has been my last resort after trying exhaustively to figure out the problem on my own.

I have been given .sam, .bam., bai and all the necessary human genome reference files, along with a .vcf file. Opening up the VCF file I can see how they are mapping to the genome, so I kind of understand the format of the file.

I wrote some code that would systematically go through the VCF file, using each line in turn to build a new single-data-line .vcf (whilst maintaining meta and header lines) file then run that through GATK (3.8.0) to retrieve the reads associated with the single VCF line entry. It would then move on to the next VCF data line, then the next etc... This had been working. Unfortunately, remotely I've received a new set of data files (bam, sam, vcf etc.), and GATK is now throwing the error:

##### ERROR MESSAGE: File associated with name ../mut1_test/OUTPUT_4_S4/temp.vcf is malformed: Problem reading the interval file caused by Badly formed genome location: Contig NC_000001.11 given as location, but this contig isn't present in the Fasta sequence dictionary
##### ERROR ------------------------------------------------------------------------------------------

The command I am doing to pull out the reads per VCF data line is:

java -jar /home/mint/GenomeAnalysisTK-3.8-0/GenomeAnalysisTK.jar -T PrintReads -L ../mut01_test/4_S4.raw_snps.vcf -o ../mut01_test/OUTPUT_4_S4/tempReadsFromSNP.bam -I ../mut01_test/4_S4.realigned_reads.bam -R ../hg38_GRCh38.p12_allChr.fa

Going over the last data delivery I notice some very suitable difference in the VCF files.

This works (old delivery):

ref|NT_032977.10|       43494   .       T       C       62.74   .       AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=31.37;SOR=0.693 GT:AD:DP:GQ:PL  1/1:0,2:2:6:90,6,0

This does not work (latest deliver)

NC_000001.11    88723   .       A       T       859.77  .       AC=2;AF=1.00;AN=2;DP=30;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=46.82;QD=28.66;SOR=5.421        GT:AD:DP:GQ:PL  1/1:0,30:30:90:888,90,0

I've looked for hidden characters etc., and purely based on software engineering experience (hence your much needed and appreciated help) I can't see why the new VCF files are "malformed".

Many thanks Anthony

snp • 798 views
ADD COMMENTlink modified 24 months ago by Vitis2.4k • written 2.2 years ago by TonyCN30
gravatar for Pierre Lindenbaum
2.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

NC_000001.11 given as location, but this contig isn't present in the Fasta sequence dictionary

it's a common problem. the variant is mapped to a contig that doesn't exist in the reference file. To fix this, you need to either obtain the reference that was originally used to align the data,

Pulling out the reads associated with a vcf line entry.

why ? what do you want to do.

ADD COMMENTlink written 2.2 years ago by Pierre Lindenbaum131k

Pierre, thank you for this information. I am guessing I need to go back to the person who generated the VCF file and ensure we are using the same reference file (i.e., hg38_GRCh38.p12_allChr.fa)?

ADD REPLYlink written 2.2 years ago by TonyCN30

That would indeed be helpful to work on the exact same reference.

ADD REPLYlink written 2.2 years ago by cschu1812.5k
gravatar for Vitis
24 months ago by
New York
Vitis2.4k wrote:

For your original purpose, using GATK is perfectly fine, might even be a bit overkill. Pysam has a "pileup engine", which would fetch all reads on top of a specific position and allow querying all information of the reads. In this way, you go through the VCF and fetch the reads covering the variant sites one by one, and avoid generating individual one-variant VCFs.

ADD COMMENTlink written 24 months ago by Vitis2.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1967 users visited in the last hour