Extracting Read and mates having variant positions Using BAM file vcf file in Python using pysam/ bamnostics
Entering edit mode
11 weeks ago
Hassan • 0

I am using Python and using PYSAM and PyVCF libraries for data handling.


Suppose I have a bam file and a vcf file containing variant calling results. I want to extract reads and mates having variants in it. Suppose There is a variant on the read and variant in the mate, I want to extract the such reads and discard all those reads having no variant positions or their mate is not having the variant( even though the reads have the variants but mate has no variant). After that, I want to calculate the count of other read_mate patterns generated from the results.

Complete question:

We have to use VCF file to search bam file, if the position on vcf file matches with reads in the bam files then check if on the same position if the same alt is present as that of vcf, if both are same then consider it and save and move on.

After that doing with all the data of the genes,(gene by gene), we will save all those reads and mates having vcf positions in it. Read and mates having vcf positions will be saved gene by gene in a variable, information should be saved as following:

initially an output file with gene name, contig name : <position,alt base>, <Position,alt base> and so on for each read and mate.

Next step after saving information of those reads and mates only having vcf positions on them, note: (discard those reads whose mates have no vcf positions), next step is counting for the read supports, it means that if the read and mate both have information means vcf pos, and same alt as on vcf, then count have many other reads and mates have 100% same positions and 100% same alt, and print it as read support.

Next step:

Merge the base patterns and noise:

Lets say for example we have a following patterns

                        pos  alt  pos   alt  pos   alt post   alt  pos alt  pos  alt  pos alt
(read,mate pattern):    55    G   58    C     63    G   75     T    87   A   87   T   95   A
(read,mate pattern):                    C     63    G   75     T    87   A   87   T   95   A
(read,mate pattern):    55    G   58    C     63    G   75     
(read,mate pattern):    55    G   58    C     63    G   75     T    87   A   87   T   95   A


                        pos  alt  pos   alt  pos   alt post   alt  pos alt  pos  alt  pos alt
(filtered pattern):     55    G   58    C     63    G   75     T    87   A   87   T   95   A  

Bottom Line:

I want to get only these reads and mates having vcf positions and make a pattern for each read and its mate having vcf positions. And discard all others without vcf position.

I am facing some problem in mapping and don't know how to use cigar string to map my data and compare DNA sequence with it after matching with vcf position to check if it has the base or not. Any help will be appreciated.


NGS_DATA PYSAM Python BAM_FILE • 194 views
Entering edit mode

Login before adding your answer.

Traffic: 1448 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6