I am using Python and using PYSAM and PyVCF libraries for data handling.
Suppose I have a bam file and a vcf file containing variant calling results. I want to extract reads and mates having variants in it. Suppose There is a variant on the read and variant in the mate, I want to extract the such reads and discard all those reads having no variant positions or their mate is not having the variant( even though the reads have the variants but mate has no variant). After that, I want to calculate the count of other read_mate patterns generated from the results.
We have to use VCF file to search bam file, if the position on vcf file matches with reads in the bam files then check if on the same position if the same alt is present as that of vcf, if both are same then consider it and save and move on.
After that doing with all the data of the genes,(gene by gene), we will save all those reads and mates having vcf positions in it. Read and mates having vcf positions will be saved gene by gene in a variable, information should be saved as following:
initially an output file with
gene name, contig name : <position,alt base>, <Position,alt base> and so on for each read and mate.
Next step after saving information of those reads and mates only having vcf positions on them, note: (discard those reads whose mates have no vcf positions), next step is counting for the read supports, it means that if the read and mate both have information means vcf pos, and same alt as on vcf, then count have many other reads and mates have 100% same positions and 100% same alt, and print it as read support.
Merge the base patterns and noise:
Lets say for example we have a following patterns
pos alt pos alt pos alt post alt pos alt pos alt pos alt (read,mate pattern): 55 G 58 C 63 G 75 T 87 A 87 T 95 A (read,mate pattern): C 63 G 75 T 87 A 87 T 95 A (read,mate pattern): 55 G 58 C 63 G 75 (read,mate pattern): 55 G 58 C 63 G 75 T 87 A 87 T 95 A
NOW ALL THESE PATTERSN SHOULD BE MERGER INTO ONE BIGGER PATTERN:
pos alt pos alt pos alt post alt pos alt pos alt pos alt (filtered pattern): 55 G 58 C 63 G 75 T 87 A 87 T 95 A
I want to get only these reads and mates having vcf positions and make a pattern for each read and its mate having vcf positions. And discard all others without vcf position.
I am facing some problem in mapping and don't know how to use cigar string to map my data and compare DNA sequence with it after matching with vcf position to check if it has the base or not. Any help will be appreciated.