Question: Pulling out the reads associated with a vcf line entry.
gravatar for anthony.nash
11 weeks ago by
anthony.nash0 wrote:

I would appreciate some help. I'm a bit of a newbie in terms of bioinformatics and I am working remotely from a team, so turning to Biostars has been my last resort after trying exhaustively to figure out the problem on my own.

I have been given .sam, .bam., bai and all the necessary human genome reference files, along with a .vcf file. Opening up the VCF file I can see how they are mapping to the genome, so I kind of understand the format of the file.

I wrote some code that would systematically go through the VCF file, using each line in turn to build a new single-data-line .vcf (whilst maintaining meta and header lines) file then run that through GATK (3.8.0) to retrieve the reads associated with the single VCF line entry. It would then move on to the next VCF data line, then the next etc... This had been working. Unfortunately, remotely I've received a new set of data files (bam, sam, vcf etc.), and GATK is now throwing the error:

##### ERROR MESSAGE: File associated with name ../mut1_test/OUTPUT_4_S4/temp.vcf is malformed: Problem reading the interval file caused by Badly formed genome location: Contig NC_000001.11 given as location, but this contig isn't present in the Fasta sequence dictionary
##### ERROR ------------------------------------------------------------------------------------------

The command I am doing to pull out the reads per VCF data line is:

java -jar /home/mint/GenomeAnalysisTK-3.8-0/GenomeAnalysisTK.jar -T PrintReads -L ../mut01_test/4_S4.raw_snps.vcf -o ../mut01_test/OUTPUT_4_S4/tempReadsFromSNP.bam -I ../mut01_test/4_S4.realigned_reads.bam -R ../hg38_GRCh38.p12_allChr.fa

Going over the last data delivery I notice some very suitable difference in the VCF files.

This works (old delivery):

ref|NT_032977.10|       43494   .       T       C       62.74   .       AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=31.37;SOR=0.693 GT:AD:DP:GQ:PL  1/1:0,2:2:6:90,6,0

This does not work (latest deliver)

NC_000001.11    88723   .       A       T       859.77  .       AC=2;AF=1.00;AN=2;DP=30;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=46.82;QD=28.66;SOR=5.421        GT:AD:DP:GQ:PL  1/1:0,30:30:90:888,90,0

I've looked for hidden characters etc., and purely based on software engineering experience (hence your much needed and appreciated help) I can't see why the new VCF files are "malformed".

Many thanks Anthony

snp • 272 views
ADD COMMENTlink modified 25 days ago by Vitis1.6k • written 11 weeks ago by anthony.nash0
gravatar for Pierre Lindenbaum
11 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum114k wrote:

NC_000001.11 given as location, but this contig isn't present in the Fasta sequence dictionary

it's a common problem. the variant is mapped to a contig that doesn't exist in the reference file. To fix this, you need to either obtain the reference that was originally used to align the data,

Pulling out the reads associated with a vcf line entry.

why ? what do you want to do.

ADD COMMENTlink written 11 weeks ago by Pierre Lindenbaum114k

Pierre, thank you for this information. I am guessing I need to go back to the person who generated the VCF file and ensure we are using the same reference file (i.e., hg38_GRCh38.p12_allChr.fa)?

ADD REPLYlink written 11 weeks ago by anthony.nash0

That would indeed be helpful to work on the exact same reference.

ADD REPLYlink written 11 weeks ago by cschu1811.4k
gravatar for Vitis
25 days ago by
New York
Vitis1.6k wrote:

For your original purpose, using GATK is perfectly fine, might even be a bit overkill. Pysam has a "pileup engine", which would fetch all reads on top of a specific position and allow querying all information of the reads. In this way, you go through the VCF and fetch the reads covering the variant sites one by one, and avoid generating individual one-variant VCFs.

ADD COMMENTlink written 25 days ago by Vitis1.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 770 users visited in the last hour