Question

Pulling out the reads associated with a vcf line entry.

0

Entering edit mode

5.6 years ago

TonyCN ▴ 60

I would appreciate some help. I'm a bit of a newbie in terms of bioinformatics and I am working remotely from a team, so turning to Biostars has been my last resort after trying exhaustively to figure out the problem on my own.

I have been given .sam, .bam., bai and all the necessary human genome reference files, along with a .vcf file. Opening up the VCF file I can see how they are mapping to the genome, so I kind of understand the format of the file.

I wrote some code that would systematically go through the VCF file, using each line in turn to build a new single-data-line .vcf (whilst maintaining meta and header lines) file then run that through GATK (3.8.0) to retrieve the reads associated with the single VCF line entry. It would then move on to the next VCF data line, then the next etc... This had been working. Unfortunately, remotely I've received a new set of data files (bam, sam, vcf etc.), and GATK is now throwing the error:

##### ERROR MESSAGE: File associated with name ../mut1_test/OUTPUT_4_S4/temp.vcf is malformed: Problem reading the interval file caused by Badly formed genome location: Contig NC_000001.11 given as location, but this contig isn't present in the Fasta sequence dictionary
##### ERROR ------------------------------------------------------------------------------------------

The command I am doing to pull out the reads per VCF data line is:

java -jar /home/mint/GenomeAnalysisTK-3.8-0/GenomeAnalysisTK.jar -T PrintReads -L ../mut01_test/4_S4.raw_snps.vcf -o ../mut01_test/OUTPUT_4_S4/tempReadsFromSNP.bam -I ../mut01_test/4_S4.realigned_reads.bam -R ../hg38_GRCh38.p12_allChr.fa

Going over the last data delivery I notice some very suitable difference in the VCF files.

This works (old delivery):

ref|NT_032977.10|       43494   .       T       C       62.74   .       AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=31.37;SOR=0.693 GT:AD:DP:GQ:PL  1/1:0,2:2:6:90,6,0

This does not work (latest deliver)

NC_000001.11    88723   .       A       T       859.77  .       AC=2;AF=1.00;AN=2;DP=30;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=46.82;QD=28.66;SOR=5.421        GT:AD:DP:GQ:PL  1/1:0,30:30:90:888,90,0

I've looked for hidden characters etc., and purely based on software engineering experience (hence your much needed and appreciated help) I can't see why the new VCF files are "malformed".

Many thanks Anthony

SNP • 1.7k views

ADD COMMENT • link updated 5.5 years ago by Vitis ★ 2.5k • written 5.6 years ago by TonyCN ▴ 60

score 2 · Answer 1 · 2018-08-26

2

Entering edit mode

5.6 years ago

Pierre Lindenbaum 161k

NC_000001.11 given as location, but this contig isn't present in the Fasta sequence dictionary

it's a common problem. the variant is mapped to a contig that doesn't exist in the reference file. To fix this, you need to either obtain the reference that was originally used to align the data,

Pulling out the reads associated with a vcf line entry.

why ? what do you want to do.

ADD COMMENT • link 5.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Pierre, thank you for this information. I am guessing I need to go back to the person who generated the VCF file and ensure we are using the same reference file (i.e., hg38_GRCh38.p12_allChr.fa)?

ADD REPLY • link 5.6 years ago by TonyCN ▴ 60

1

Entering edit mode

That would indeed be helpful to work on the exact same reference.

ADD REPLY • link 5.6 years ago by cschu181 ★ 2.8k

score 0 · Answer 2 · 2018-10-23

For your original purpose, using GATK is perfectly fine, might even be a bit overkill. Pysam has a "pileup engine", which would fetch all reads on top of a specific position and allow querying all information of the reads. In this way, you go through the VCF and fetch the reads covering the variant sites one by one, and avoid generating individual one-variant VCFs.

https://pysam.readthedocs.io/en/latest/