FAIRE for non-standard cultivar, mapping to reference, MAPQ
0
0
Entering edit mode
3.1 years ago
boczniak767 ▴ 850

Hi,

I have FAIRE-seq data from two cultivars: reference B73 and less known S68. I assumed I can align reads from both cultivars to B73 reference sequence, which is standard. I used bbmap with default settings.

I've seen, that MAPQ>10 read filtering is frequently applied. The problem is, that (I think) too low percentage of reads pass this filter. It is ca. 50-60% for B73, which I assume should be mapped very well, and ca. 40-48% for S68.

I have folowing questions:

  1. What settings, for example for bbmap or bwa mem are good for alignment of chromatin-level data from non-standard cultivar to reference genome?
  2. What MAPQ threshold should I apply?
  3. I haven't filtered fastq files for base qualities (looks good in fastqc after adapter removal). Should I do this filtering?

Here are MAPQ plots from bamqc b73 cultivar enter image description here

s68 cultivar enter image description here

sequencing faire mapping • 2.9k views
ADD COMMENT
1
Entering edit mode

How long are your reads? There is usually no need to change defaults, FAIRE is just normal DNA-seq in the end. Could it be that large parts of the genome are quite low-complexity so repetitive?

ADD REPLY
0
Entering edit mode

My reads are 125bp single-end. Yes maize genome is highly repetitive, nearly 85% according to Schanble et al 2009. Good point, I was focused rather on existence of snp's and indels which certainly makes reads to map badly. I've read, that mapping to masked genome is not recommended. So, is the MAPQ>10 filter ok?

ADD REPLY
0
Entering edit mode

I use 20 for mouse and human but there is no special reason for 20 over 10, 15, 18.636261 etc. Try 10 and see whether results make sense I‘d say.

ADD REPLY
0
Entering edit mode

Thanks, now, as I look at my charts I realized there will be slight difference between 10 and 20. At least in terms of reads remaining.

ADD REPLY
0
Entering edit mode

Prior thread for reference discussion : bbmap for FAIRE-seq and other line than reference

I suggest that you run clumpify.sh (A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files ). It is possible that you have a large amount of sequence duplication. Even at 125 bp it is surprising that your reads are multi-mapping but then this is the maize genome.

ADD REPLY
0
Entering edit mode

Good idea, I thought that clumpify is just better compressing alghoritm. I've found Stampy, the authors claim it is good for mapping when snp's and indels are expected. I'll try several methods. I must say, I've found two publications (on maize and mosquito) with similar percent of filtered reads.

ADD REPLY
0
Entering edit mode

Real snp and indels is one thing but since you don't know the ground truth there is no good way for you to judge if bbmap (or any other aligner) is doing a reasonable job. If you allow for additional errors then you may have a worse problem with multi-mapping.

Post the results of clumpify and let us see if you have sequence duplicates that can be removed prior to alignment.

ADD REPLY
0
Entering edit mode

I've run ~/bin/bbmap/clumpify.sh in=hds35_clean.fastq.gz out=clumped35.fastq.gz groups=24

Below is the info printed on the screen.

In fact the cultivar I work on has been resequenced, so I have some info about variants. I planned to use this info after peak calling, to look for a nucleotide-level differences.

I've read about removal of optical duplicates but I didn't think it is a big concern. After all I'm looking for peaks, generated from regions with high number of reads. Does tools like clumpify cause peaks to disappear?

Time:                           201.288 seconds.
Reads Processed:        45647k  226.77k reads/sec
Bases Processed:         5464m  27.15m bases/sec
Reads In:             45647156
Clumps Formed:         9941632
Total time:     375.230 seconds.
ADD REPLY
1
Entering edit mode

Can you run clumpify.sh with the dedupe option on ORIGINAL data. Do not use trimmed reads. Do you have single-end reads or are these interleaved? If you have paired-end reads then please use both. You don't need to provide an out= file if you don't want to get the deduped reads. This will still produce the stats output you see above.Note: Depending on what sequencer this was sequenced on you will need to change the dist and spantiles options. Use optical dist=40 spantiles=f after making necessary changes. Take a look at in-line help.

I edited the full log to keep stat part we need above.

ADD REPLY
0
Entering edit mode

/thanks for tidying-up my post, I can't figure out how to paste such text properly. Here is the ouptut with original data and dedupe. I have reads from Illumina, in this case it is single-end.

Time:                           52.844 seconds.
Reads Processed:        45661k  864.08k reads/sec
Bases Processed:         5707m  108.01m bases/sec
Reads In:             45661691
Clumps Formed:        10449866
Duplicates Found:      6614639
Reads Out:            39047052
Bases Out:          4880881500
Total time:     470.849 seconds.
ADD REPLY
1
Entering edit mode

So you don't have a large fraction of duplicate reads but there are some (~15%). Because you have single end data that is probably contributing to not having dual anchors to place the reads on genome. You could use de-duped sequences and see if that helps improve mapq values to some extent.

Part of post you want to present with mono-spaced font can be selected by highlighting and then you can use the 101010 button in top edit bar to format that part as code.

ADD REPLY
0
Entering edit mode

So you propose that I dedupe raw reads using above command, trim deduped file and use it to mapping?

Doesn't that compromise peak-calling?

ADD REPLY
0
Entering edit mode

Without UMI's there is no definite way to confirm if those 15% sequence duplicates are PCR or otherwise. Since you were worried about mapq being low we have been discussing these possibilities to see if that could be improved.

If you have a different option to try at this point then try that out but this may be the last thing to try with BBMap.

ADD REPLY
0
Entering edit mode

OK, thanks for help guys, in the end I'm not that worried about mapq. I'll experiment with some aligners and see what works best.

Nevertheless I've trimmed deduped fastq file and mapped it using default setings of bbmap. Here is the result of bamqc deduped

ADD REPLY
0
Entering edit mode

As I expected there is some improvement in mapq. This is a difficult problem to tackle. Let us know if you fare better with other aligners though I suspect this may be one of the better results.

ADD REPLY
0
Entering edit mode

What you could do is to take the reference fasta file you have, simulate reads of the same length and then see what the alignment rate is, so whether this low MAPQ / multimapping problem is indeed a consequence of the repetitiveness of the reference, or due to some artifacts in the samples you have.

ADD REPLY
0
Entering edit mode

Ok, so I've tried bwa, bwa + realignment with stampy (option --bamkeepgoodreads). The bamqc plot looks somehow 'smeared' in the low mapq range with ca. 10% less high scoring alignments. stampy helped somehow, but in the end bbduk gives better results, at least in term of % of high scoring mappings. Ah, and it is much faster. I think I'll rely on bbduk with maxindel=20 ambig=random, maybe I'll also look at peak-calling results on alignment by bwa + stampy. As of simulations, unfortunately I don't have time to try it, I think I've already delved too much in aligners.

ADD REPLY

Login before adding your answer.

Traffic: 3220 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6