I am looking at Heng Li' s paper "Towards Better Understanding of Artifcats in Variant Calling from High_coverage Samples".
I am interested in finding a tool like mdust which finds low complexity regions, but I would prefer not to convert my data to FASTA format if possible. DO you know of a tool which performs an mdust like function but on FASTQ or BAM input. AN additional requirement that I have is that the tool should be usable via command-line, not a web-based service.
There is also
dustmaskerthat is distributed with blast+ binaries. It sounds similar in that it outputs a file of intervals by default:
I think you can also still get mdust or dust from legacy blast distributions but maybe not (can't remember for certain).
To confirm, dustmasker is bundled with NCBI BLAST+.
@lh3, can you expand a bit on "it is not advised"? Why is this not recommended? I can understand it perhaps if you mean that most tools are not anticipating having large swaths of soft or hard masking in the incoming nucleotide sequences, and thus might work suboptimally.
If you mask the reference prior to mapping, some reads that originated from masked regions will map to similar sequences in the unmasked regions, since that's all that's available for them to map to. This leads to false positives. Similarly, masking the reads decreases the amount of information available which reduces accuracy. Tools generally don't care if you mask vast swaths of the genome, or tend to be positively impacted in terms of speed (low complexity or repetitive regions can slow down alignment), but destroying information or concealing it from the aligner will cause it to yield incorrect results.
Thanks for your reply. I am not following your false positive example. If you have a read R1 that aligns to an unmasked nucleotide DB at loci L1 and L2, how does an alignment of R1 to masked nucleotide DB (same thresholds), where locus L1 is a low-complexity region and thus does not appear in the output, lead to a false positive? Did you mean a false negative?
Let's say that you have read AAAAAAAAAAAAATTTTTTTTTTTTTTT originating from genomic sequence S1 = AAAAAAAAAAAAATTTTTTTTTTTTTTT. You mask sequences with low complexity, and that region is masked, but similar region S2 = AAAAAGAAAAAAATTTTTTTCTTTTTTT is not masked because it is higher-complexity. Then the read will will map to S2 and you will get 2 false-positive variant calls, spurious high coverage, and other artifacts.
I hope you can clarify the following: I saw that on your Github/varcmp, there is a LC file for hg38. Using your sdust implementation on hg38.fa myself, I get a file that is almost double as big, even when removing the alternate haplotypes. Any idea why the two files are different. I used sdust without any further parameters on the current hg38.