Pulling out the reads associated with a vcf line entry.
2
0
Entering edit mode
5.6 years ago
TonyCN ▴ 60

I would appreciate some help. I'm a bit of a newbie in terms of bioinformatics and I am working remotely from a team, so turning to Biostars has been my last resort after trying exhaustively to figure out the problem on my own.

I have been given .sam, .bam., bai and all the necessary human genome reference files, along with a .vcf file. Opening up the VCF file I can see how they are mapping to the genome, so I kind of understand the format of the file.

I wrote some code that would systematically go through the VCF file, using each line in turn to build a new single-data-line .vcf (whilst maintaining meta and header lines) file then run that through GATK (3.8.0) to retrieve the reads associated with the single VCF line entry. It would then move on to the next VCF data line, then the next etc... This had been working. Unfortunately, remotely I've received a new set of data files (bam, sam, vcf etc.), and GATK is now throwing the error:

##### ERROR MESSAGE: File associated with name ../mut1_test/OUTPUT_4_S4/temp.vcf is malformed: Problem reading the interval file caused by Badly formed genome location: Contig NC_000001.11 given as location, but this contig isn't present in the Fasta sequence dictionary
##### ERROR ------------------------------------------------------------------------------------------

The command I am doing to pull out the reads per VCF data line is:

java -jar /home/mint/GenomeAnalysisTK-3.8-0/GenomeAnalysisTK.jar -T PrintReads -L ../mut01_test/4_S4.raw_snps.vcf -o ../mut01_test/OUTPUT_4_S4/tempReadsFromSNP.bam -I ../mut01_test/4_S4.realigned_reads.bam -R ../hg38_GRCh38.p12_allChr.fa

Going over the last data delivery I notice some very suitable difference in the VCF files.

This works (old delivery):

ref|NT_032977.10|       43494   .       T       C       62.74   .       AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=31.37;SOR=0.693 GT:AD:DP:GQ:PL  1/1:0,2:2:6:90,6,0

This does not work (latest deliver)

NC_000001.11    88723   .       A       T       859.77  .       AC=2;AF=1.00;AN=2;DP=30;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=46.82;QD=28.66;SOR=5.421        GT:AD:DP:GQ:PL  1/1:0,30:30:90:888,90,0

I've looked for hidden characters etc., and purely based on software engineering experience (hence your much needed and appreciated help) I can't see why the new VCF files are "malformed".

Many thanks Anthony

SNP • 1.7k views
ADD COMMENT
2
Entering edit mode
5.6 years ago

NC_000001.11 given as location, but this contig isn't present in the Fasta sequence dictionary

it's a common problem. the variant is mapped to a contig that doesn't exist in the reference file. To fix this, you need to either obtain the reference that was originally used to align the data,

Pulling out the reads associated with a vcf line entry.

why ? what do you want to do.

ADD COMMENT
0
Entering edit mode

Pierre, thank you for this information. I am guessing I need to go back to the person who generated the VCF file and ensure we are using the same reference file (i.e., hg38_GRCh38.p12_allChr.fa)?

ADD REPLY
1
Entering edit mode

That would indeed be helpful to work on the exact same reference.

ADD REPLY
0
Entering edit mode
5.5 years ago
Vitis ★ 2.5k

For your original purpose, using GATK is perfectly fine, might even be a bit overkill. Pysam has a "pileup engine", which would fetch all reads on top of a specific position and allow querying all information of the reads. In this way, you go through the VCF and fetch the reads covering the variant sites one by one, and avoid generating individual one-variant VCFs.

https://pysam.readthedocs.io/en/latest/

ADD COMMENT

Login before adding your answer.

Traffic: 1929 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6