Question: How to do genome guided denovo assembly? If not, how to align the denovo assembled bam to a particular reference genome?
0
gravatar for kirannbishwa01
3.1 years ago by
kirannbishwa011.2k
United States
kirannbishwa011.2k wrote:

I have two different question which may sound similar, but I think are different:

How to do genome guided denovo assembly? How to align the denovo assembled bam to a particular reference genome?

Explanation of problem in detail: I have aligned the fastq reads (100 bp - PE) for several samples from different individuals of different population to a reference genome using BWA mem.

The initial alignment percentage was good (about 80 - 90 %) read alignment. I then did a hard filtering on the aligned reads by removing the reads from the regions:

  • with high coverage (above 97.5th percentile coverage distribution)
  • reads where mates are mapped to different chromosome
  • reads that contain hets sites for all the samples
  • reads that have low mapQ
  • and several other stringent filtering

This filtered about 30-40 % of the aligned reads from each sample. Now, I am thinking if I can

  • take this filtered reads (as bam) and convert them to fastq
  • and run a denovo assembly of these fastq (genome guided/unguided ??)
  • or merge/align the denovo assembled bam to a reference

I think this should be able to take care of paralogous alignment, identifying big InDels, chromosomal changes to certain extent. Any one tried this? with which software might this be possible?

Thanks,

ADD COMMENTlink modified 3.1 years ago by h.mon30k • written 3.1 years ago by kirannbishwa011.2k

The initial alignment percentage was good (about 80 - 90 %) read alignment. I then did a hard filtering on the aligned reads by removing the reads:

with high coverage (97.5th percentile coverage) reads where mates are mapped to different chromosome reads that contain hets sites for all the samples and several other stringent filtering

Why did you do these operations?

ADD REPLYlink written 3.1 years ago by Brian Bushnell17k

Hi @Brian:

If the issue about filtering isn't clear, this is what I did.

  • after doing the initial alignment using BWA mem, I plotted the coverage of the alignment through out the genome for all the chormosome. This is how I plotted the coverage: C: Plotting the coverage and depth (Y-axis) statistics of a bam file along the geno
  • then I created a bed coverage map for each chromosome for each sample and filtered the aligned reads where the coverage were more then 97.5% ile of the coverage distribution.
  • I also filtered low mapQ read (below mapQ 30).

These kind of alignment regions generally are contaminated with paralogous alignment, so I had to do it. But, I don't want to throw the data away. I rather want to run a de novo assembly (using all the filtered data), then create a long read out of it, after that I want to blast this read with several different close species to see what's going on.

ADD REPLYlink written 3.1 years ago by kirannbishwa011.2k

These kind of alignment regions generally are contaminated with paralogous alignment, so I had to do it.

I'm not so sure about that. I'd suggest assembling with the raw data, after adapter-trimming, quality-trimming, and error-correction. Though I'd be interested in seeing any evidence you have regarding that point.

ADD REPLYlink written 3.1 years ago by Brian Bushnell17k

Doesn't the extremely high coverage in a particular region indicate misalignment of the paralogous reads?

ADD REPLYlink written 3.1 years ago by kirannbishwa011.2k
1

Not necessarily... high coverage might indicate a misassembly, or a structural variation, or simply normal Illumina bias. If you filter out regions over the 97.5% percentile of coverage, you are guaranteed to lose 2.5% of your genome, whether it's correct or not, and whether the coverage is super-high or not. It does not seem to me to be a good strategy. Also, mapq 30 is an odd filter to apply. Variant-callers already take mapq into account, and 30 is pretty high. I would expect filtering on mapq 30 to yield extreme ref bias, which makes variant calling much less accurate.

ADD REPLYlink written 3.1 years ago by Brian Bushnell17k

So, what is you strategy for dealing with paralogous alignment. What is you choice for best filtering strategy after alignment step?

Also, I am not talking about human genome in here which are mostly homozygous. Since you have more experience with alignment: How should filtering strategy change when working with human like data (primates) vs. other animals vs. plants?

Any suggestions appreciated.

Thanks,

ADD REPLYlink written 3.1 years ago by kirannbishwa011.2k

What organism are you working with?

ADD REPLYlink written 3.1 years ago by Brian Bushnell17k

Arabidopsis lyrata.

ADD REPLYlink written 3.1 years ago by kirannbishwa011.2k

@ Brian: So, any suggestions ?

ADD REPLYlink written 3.1 years ago by kirannbishwa011.2k
1

Generally, I would just map and call variants, and use the variant quality score as the primary filter critera. You may want to apply a mapq filter, but not that high. And I'm not convinced at all that a blanket high-coverage filter is a good idea. Or at least, not for 2.5% of the genome.

I find that it's helpful to look at the variants with IGV and Excel to see if there are regions with dense mismatches with low-quality-score variants, or other suspicious features (and yes, super-high coverage is suspicious, particularly when that region contains SNPs at strange allele ratios, like 1:5 var:ref in a diploid). When you spot them you can manually remove everything from that region, or look at the components of the variant quality (such as the strand ratio, average quality score, position of the variants in the read, etc) to see if there is some specific feature you can use to get rid of a lot of false positives automatically. This kind of depends on the data; the universal rules that work well are probably already baked into variant callers as defaults.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Brian Bushnell17k

Thanks for the Info Brian. Actually, I have also looked the data with IGV, and saw that regions with high coverage have too much clustered SNPs which made me suspect those regions might be ill-aligned. I realize that using the bed file for coverage will remove all the read that touch that region (thus loosing more than just high coverage sites).

I might change my parameters to somewhat 1% and use allele-ratio as another guide.

Most of my sample have the similar patterns of coverage, as shown in this question/tutorial. There is no alignment at/around the centromere. So, I am thinking if the aligner forces the reads to align to another similar regions.

But, that for the suggestions !

ADD REPLYlink written 3.1 years ago by kirannbishwa011.2k

I want to blast this read with several different close species to see what's going on.

This seems a XY problem. What do you really want to do? What is the question you are interest in?

How to align the denovo assembled bam to a particular reference genome?

Bam is most commonly the output of mapping, not assembly. And you usually (never?) align a bam to a fasta.

ADD REPLYlink written 3.1 years ago by h.mon30k

I am actually thinking about assembling the reads that were filtered. Then collapse that assembled reads to create a single reads (consensus sequence) which could be longer in length; and be able to blast that read across different species.

ADD REPLYlink written 3.1 years ago by kirannbishwa011.2k
0
gravatar for h.mon
3.1 years ago by
h.mon30k
Brazil
h.mon30k wrote:

How to do genome guided denovo assembly?

I didn't notice this before, but "genome guided" is one thing, "de novo" is another. (scratch that, there are genome-guided de novo strategies.)

As Brian Bushnell told you - and as your results suggest - your filtering strategies are too agressive. You will most likely get shorter assemblies than using your full data. I think they will contain both fewer mis-assemblies, but also will miss some correct assemblies.

I never worked with the Arabdopsis lyrata genome, but from the paper it seems to be of good quality. Unless you have a good mix of libraries (good coverage per sample, paired ends and some large mate-pair), a de novo assembly will get less contiguous assemblies than the reference genome. If you have only paired and data, your assemblies will be very fragmented, there is no way around it.

So I suggest you map the samples against the A.lyrata reference genome, and use this a starting point to search for variations small (SNPs and indels) and large (larger deletions / duplications, structural variations). There are several software for that.

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by h.mon30k

Thanks @h.mon.

Actually, I have already done mapping on A. lyrata reference using GATK pipeline. I had to do agressive filtering just minimize wrong alignments. But, I din't wanted to lose so much data (about 40 %).

Do you think updating the reference using the homozygous Variants SNPs, InDels (certain length like 20 bp) present in all samples and realigning reads to this updated reference might be a good idea ?? Would this be fruitful or am I better of just doing moderate filtration?

ADD REPLYlink written 3.1 years ago by kirannbishwa011.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1962 users visited in the last hour