Entering edit mode
4.6 years ago
boczniak767
▴
880
Hi,
I have FAIRE-seq data from two cultivars: reference B73 and less known S68.
I assumed I can align reads from both cultivars to B73 reference sequence, which is standard.
I used bbmap with default settings.
I've seen, that MAPQ>10 read filtering is frequently applied. The problem is, that (I think) too low percentage of reads pass this filter. It is ca. 50-60% for B73, which I assume should be mapped very well, and ca. 40-48% for S68.
I have folowing questions:
- What settings, for example for
bbmaporbwa memare good for alignment of chromatin-level data from non-standard cultivar to reference genome? - What MAPQ threshold should I apply?
- I haven't filtered
fastqfiles for base qualities (looks good infastqcafter adapter removal). Should I do this filtering?
Here are MAPQ plots from bamqc
b73 cultivar
s68 cultivar
How long are your reads? There is usually no need to change defaults, FAIRE is just normal DNA-seq in the end. Could it be that large parts of the genome are quite low-complexity so repetitive?
My reads are 125bp single-end. Yes maize genome is highly repetitive, nearly 85% according to Schanble et al 2009. Good point, I was focused rather on existence of snp's and indels which certainly makes reads to map badly. I've read, that mapping to masked genome is not recommended. So, is the MAPQ>10 filter ok?
I use 20 for mouse and human but there is no special reason for 20 over 10, 15, 18.636261 etc. Try 10 and see whether results make sense I‘d say.
Thanks, now, as I look at my charts I realized there will be slight difference between 10 and 20. At least in terms of reads remaining.
Prior thread for reference discussion : bbmap for FAIRE-seq and other line than reference
I suggest that you run
clumpify.sh(A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files ). It is possible that you have a large amount of sequence duplication. Even at 125 bp it is surprising that your reads are multi-mapping but then this is the maize genome.Good idea, I thought that
clumpifyis just better compressing alghoritm. I've found Stampy, the authors claim it is good for mapping when snp's and indels are expected. I'll try several methods. I must say, I've found two publications (on maize and mosquito) with similar percent of filtered reads.Real
snpandindelsis one thing but since you don't know the ground truth there is no good way for you to judge if bbmap (or any other aligner) is doing a reasonable job. If you allow for additional errors then you may have a worse problem with multi-mapping.Post the results of
clumpifyand let us see if you have sequence duplicates that can be removed prior to alignment.I've run
~/bin/bbmap/clumpify.sh in=hds35_clean.fastq.gz out=clumped35.fastq.gz groups=24Below is the info printed on the screen.
In fact the cultivar I work on has been resequenced, so I have some info about variants. I planned to use this info after peak calling, to look for a nucleotide-level differences.
I've read about removal of optical duplicates but I didn't think it is a big concern. After all I'm looking for peaks, generated from regions with high number of reads. Does tools like
clumpifycause peaks to disappear?Can you run
clumpify.shwith thededupeoption on ORIGINAL data. Do not use trimmed reads. Do you have single-end reads or are these interleaved? If you have paired-end reads then please use both. You don't need to provide anout=file if you don't want to get the deduped reads. This will still produce the stats output you see above.Note: Depending on what sequencer this was sequenced on you will need to change thedistandspantilesoptions. Useoptical dist=40 spantiles=fafter making necessary changes. Take a look at in-line help.I edited the full log to keep stat part we need above.
/thanks for tidying-up my post, I can't figure out how to paste such text properly. Here is the ouptut with original data and
dedupe. I have reads from Illumina, in this case it is single-end.So you don't have a large fraction of duplicate reads but there are some (~15%). Because you have single end data that is probably contributing to not having dual anchors to place the reads on genome. You could use de-duped sequences and see if that helps improve mapq values to some extent.
Part of post you want to present with mono-spaced font can be selected by highlighting and then you can use the
101010button in top edit bar to format that part as code.So you propose that I dedupe raw reads using above command, trim deduped file and use it to mapping?
Doesn't that compromise peak-calling?
Without UMI's there is no definite way to confirm if those 15% sequence duplicates are PCR or otherwise. Since you were worried about mapq being low we have been discussing these possibilities to see if that could be improved.
If you have a different option to try at this point then try that out but this may be the last thing to try with BBMap.
OK, thanks for help guys, in the end I'm not that worried about mapq. I'll experiment with some aligners and see what works best.
Nevertheless I've trimmed deduped fastq file and mapped it using default setings of
bbmap. Here is the result ofbamqcAs I expected there is some improvement in
mapq. This is a difficult problem to tackle. Let us know if you fare better with other aligners though I suspect this may be one of the better results.What you could do is to take the reference fasta file you have, simulate reads of the same length and then see what the alignment rate is, so whether this low MAPQ / multimapping problem is indeed a consequence of the repetitiveness of the reference, or due to some artifacts in the samples you have.
Ok, so I've tried
bwa,bwa+ realignment withstampy(option--bamkeepgoodreads). Thebamqcplot looks somehow 'smeared' in the low mapq range with ca. 10% less high scoring alignments.stampyhelped somehow, but in the endbbdukgives better results, at least in term of % of high scoring mappings. Ah, and it is much faster. I think I'll rely onbbdukwithmaxindel=20 ambig=random, maybe I'll also look at peak-calling results on alignment bybwa+stampy. As of simulations, unfortunately I don't have time to try it, I think I've already delved too much in aligners.