I want to extract the
PI information from all the samples of a vcf. I have python script and reading throught
pyVCF module I can extract information from one sample but not from all.
And, I am looking for the solution scrictly in python using pyvcf. I can probably read the file (vcf) as txt and parse through each line but I am hoping that
pyVCF can be used (which I prefer).
Below is the structure of my vcf file:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 2ms01e 2ms02g 2ms03g 2ms04h 2 1738 . A G 4693.24 PASS AC=2;AF=0.250;AN=8;set=Intersection GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW 0|1:389,92:481:99:.,.,.,.,.:1.0:0|1:1020:1748,0,12243:0|1 0/0:318,0:318:99:.:.:0/0:.:0,120,1800:0/0 0|1:270,53:323:99:.,.,.,.,.:1.0:0|1:1258:990,0,9096:0|1 0/0:473,0:473:99:.:.:0/0:.:0,120,1800:0/0 2 1764 . A C 51892.85 PASS AC=5;AF=0.625;AN=8;set=Intersection GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW 1|0:102,415:517:99:.,.,.,.,.:1.0:1|0:1020:12332,0,2817:1|0 1/1:0,356:356:99:.:.:1/1:.:12587,1069,0:1/1 1|0:65,301:366:99:.,.,.,.,.:1.0:1|0:1258:9337,0,1279:1|0 0/1:281,353:634:99:.:.:0/1:.:10325,0,7548:0/1 2 1921 . T C 4465.03 PASS AC=0;AF=0.00;AN=6;set=Intersection GT:AD:DP:GQ:PG:PL:PW 0/0:1,0:1:3:0/0:0,3,35:0/0 ./.:0,0:0:.:./.:0,0,0:./. 0/0:1,0:1:3:0/0:0,3,39:0/0 0/0:2,0:2:6:0/0:0,6,80:0/0
Below is the python script I wrote:
vcf_file = vcf.Reader(open('RNAseq.phased_variants.Final_sub.vcf', 'r')) for record in vcf_file: contig = record.CHROM pos = record.POS ref_allele = record.REF alt_alleles = ",".join(map(str, (record.ALT[::]))) genotype_2ms02g = record.genotype('2ms02g')['GT'] phased_2ms02g = record.genotype('2ms02g')['PG'] PI_2ms02g = record.genotype('2ms02g')['PI'] call_2ms02g = record.genotype('2ms02g') # calling the record values of the sample 2ms02g gt_bases_2ms02g = call_2ms02g.gt_bases #This reports the genotype as letters (nucleotide code)
I can repeat the procedure for other samples as:
genotype_2ms01e = record.genotype('2ms01e')['GT'] phased_2ms01e = record.genotype('2ms01e')['PG'] PI_2ms01e = record.genotype('2ms01e')['PI'] call_2ms01e = record.genotype('2ms01e') gt_bases_2ms01e = call_2ms01e.gt_bases
But, I want to call these values from all the samples and write it as a table. vcf file may have many samples and I want to call the values from all the samples without doing so for each sample separately.
Finally I want to write the variables as tables
output = open("2ms02gVcf_to_table-Markov.txt", "a") output.write(contig + '\t' + str(pos) + '\t' + ref_allele + '\t' + str(alt_alleles) + '\t' + genotype_2ms01e + '\t' + phased_2ms01e +'\t' + str(PI_2ms01e) + '\t'+ gt_bases_2ms01e + '\t' + same for other samples........ +'\n') output.close()
Not, all sites (or lines) have
I wanted to throw an
exception error using (by adding below script) but its not working.
try: phased_genotype = record.genotype('2ms02g')['PG'] phase_block_index = record.genotype('2ms02g')['PI'] except IndexError: phased_genotype = '.' phase_block_index = '.'
Thank you advance !