Question: Mpileup generation using Bisulfite converted reference
gravatar for kspata
12 months ago by
kspata50 wrote:

I have a bisulfite converted reference genome (used bismark_genome_preparation on the reference file). Can I use this bisulfite converted reference genome to create index file using samtools faidx? I want to create a samtools mpileup file for downstream analysis.

The bismark_genome_preparation creates following folders and files:

Bisulfite-Genome: 1. CT_conversion -BS_CT.1.bt2 -BS_CT.3.bt2 -BS_CT.rev.1.bt2 -BS_CT.2.bt2 - BS_CT.rev.2.bt2 - BS_CT.4.bt2 - genome_mfa.CT_conversion.fa 2. GA_conversion -BS_GA.1.bt2 -BS_GA.3.bt2 -BS_GA.rev.1.bt2 -BS_GA.2.bt2 - BS_GA.rev.2.bt2 - BS_GA.4.bt2 - genome_mfa.GA_conversion.fa

Can I use the files genome_mfa.CT_conversion.fa and genome_mfa.GA_conversion.fa to create index files using samtools faidx command?

My other question is

How to generate mpileup files from these faidx reference files? Because it will generate two mpileup files for each reference sequence? Can I then merge two mpileup files for downstream analysis?

 samtools mpileup -d <integer> -f <ref1>  <input.bam> >  output_1.mpileup
 samtools mpileup -d <integer> -f <ref2>  <input.bam> >  output_2.mpileup

Can I then merge the two mpileup output files?

Is there any other approach?

Thanks In advance.

ADD COMMENTlink written 12 months ago by kspata50

That would be a yes on the first question. As for the second question, that would be a bit more dedicated. I assume, you'd need to check each position from both mpileups and convert the information so that the variant information conforms to the original reference (for mpileup the CT and AG converted references are two different references, so the information provided in the files is not compatible.)

What I am wondering, though - why don't you use bismark and its downstream analysis? The whole genome methylation report will then give you exactly what you seemingly want to have (at least if I infer it correctly from your question.)

Edit: bismark-comment

ADD REPLYlink modified 12 months ago • written 12 months ago by cschu1811.7k

Thank you for this insight. I really appreciate it. I did use bismark pipeline for downstream analysis instead of the mpileup generation.

I have another question where I am studying the methylation status of a gene DMPK. For this purpose is it better to map samples across whole genome reference sequence (hg38) or only DMPK gene region?

I ran my samples with both and observed that coverage after mapping to entire human genome is very less (<7X) and when I map same samples with DMPK gene sequence the coverage increases to upto 100X. The samples were sequenced on MiSeq and are bisulfite converted amplicons.

Which one of the two approaches is correct?

ADD REPLYlink written 12 months ago by kspata50

I think the way to go here is to map against the whole genome. The aligner tries to map as many reads as possible, so if there is some similarity between your gene region and other regions in the genome, it might just be good enough to (mis-)align reads from those other regions to your target region. If you map to the whole genome, those reads will have better alignment scores against their original regions and will therefore not align to your target, which would explain your low coverage.

ADD REPLYlink written 12 months ago by cschu1811.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 961 users visited in the last hour