Looking for reputable tool to find low complexity regions in FASTQ or BAM files
3
1
Entering edit mode
7.3 years ago
nr600605 ▴ 10

I am looking at Heng Li' s paper "Towards Better Understanding of Artifcats in Variant Calling from High_coverage Samples".

I am interested in finding a tool like mdust which finds low complexity regions, but I would prefer not to convert my data to FASTA format if possible. DO you know of a tool which performs an mdust like function but on FASTQ or BAM input. AN additional requirement that I have is that the tool should be usable via command-line, not a web-based service.

Thank-you!

mdust next-gen lowcomplexity sequence • 7.8k views
7
Entering edit mode
7.3 years ago
lh3 33k

You should mask the reference genome after mapping and filter out variants/features fall in low-complexity regions. It is not advised to mask the read sequences, or mask the reference genome before mapping. The mdust program used in the paper is available at:

ftp://occams.dfci.harvard.edu/pub/bio/tgi/software/seqclean/

NCBI blast is using an enhanced algorithm called sdust. You can compile it from the NCBI toolkit. I have reimplemented the same algorithm. To get it:

git clone https://github.com/lh3/minimap
cd minimap; make sdust


You then simply throw away features/variants fall in masked.bed. You usually don't need to apply the mask to ref.fa. Option -t controls how much to mask. The higher the value, the fewer regions get masked. Gene Myers has independently reimplemented sdust as well.

0
Entering edit mode

There is also dustmasker that is distributed with blast+ binaries. It sounds similar in that it outputs a file of intervals by default:

\$ ./dustmasker
USAGE
dustmasker [-h] [-help] [-xmlhelp] [-in input_file_name]
[-out output_file_name] [-window window_size] [-level level]
[-parse_seqids] [-version-full]

DESCRIPTION
Low complexity region masker based on Symmetric DUST algorithm

Use '-help' to print detailed descriptions of command line arguments

I think you can also still get mdust or dust from legacy blast distributions but maybe not (can't remember for certain).

0
Entering edit mode

To confirm, dustmasker is bundled with NCBI BLAST+.

0
Entering edit mode

@lh3, can you expand a bit on "it is not advised"? Why is this not recommended? I can understand it perhaps if you mean that most tools are not anticipating having large swaths of soft or hard masking in the incoming nucleotide sequences, and thus might work suboptimally.

1
Entering edit mode

If you mask the reference prior to mapping, some reads that originated from masked regions will map to similar sequences in the unmasked regions, since that's all that's available for them to map to. This leads to false positives. Similarly, masking the reads decreases the amount of information available which reduces accuracy. Tools generally don't care if you mask vast swaths of the genome, or tend to be positively impacted in terms of speed (low complexity or repetitive regions can slow down alignment), but destroying information or concealing it from the aligner will cause it to yield incorrect results.

0
Entering edit mode

Thanks for your reply. I am not following your false positive example. If you have a read R1 that aligns to an unmasked nucleotide DB at loci L1 and L2, how does an alignment of R1 to masked nucleotide DB (same thresholds), where locus L1 is a low-complexity region and thus does not appear in the output, lead to a false positive? Did you mean a false negative?

1
Entering edit mode

Let's say that you have read AAAAAAAAAAAAATTTTTTTTTTTTTTT originating from genomic sequence S1 = AAAAAAAAAAAAATTTTTTTTTTTTTTT. You mask sequences with low complexity, and that region is masked, but similar region S2 = AAAAAGAAAAAAATTTTTTTCTTTTTTT is not masked because it is higher-complexity. Then the read will will map to S2 and you will get 2 false-positive variant calls, spurious high coverage, and other artifacts.

0
Entering edit mode

I hope you can clarify the following: I saw that on your Github/varcmp, there is a LC file for hg38. Using your sdust implementation on hg38.fa myself, I get a file that is almost double as big, even when removing the alternate haplotypes. Any idea why the two files are different. I used sdust without any further parameters on the current hg38.

0
Entering edit mode
7.3 years ago

BBDuk and BBMask both detect low-complexity areas in fasta or fastq. BBDuk does filtering (removing low-complexity sequences), while BBMask does masking of low-entropy sequences, though both use the same algorithm to detect them. The sensitivity is adjustable with the entropy flag.

For example:

bbduk.sh in=reads.fq out=filtered.fq outm=low_complexity.fq entropy=0.7


The higher the entropy threshold, the more reads will be classified as low-complexity; it ranges from 0 (nothing will get filtered) to 1 (almost everything will be filtered).

0
Entering edit mode

Hi Brian,

thanks for this great set of scripts.

I want to mask low complexity reads before mapping and came across your thread post here. I did a couple of bbduk.sh runs on my fastq.fq files (HiSeq shallow sequenced libraries to ~3M reads; command follows your example above) and noticed that a vast range of entropy values (0.0001 - 0.8) always delete ~30% of my data (30.04% and 30.17% respectively). Inspecting the designated low complexity reads I am not sure (at least for some of them) why they were filtered and the number (at least for 0.0001) seems too high. Would you mind to elaborate briefly on the algorithm behind the bbduk.sh script?

thanks

0
Entering edit mode

Can you please post the full command, version number, and stderr output you're getting? Also, please post (or email me) a handful of reads that are getting filtered out (the outm stream) at entropy=0.0001.

Thanks!

0
Entering edit mode
6.1 years ago

I would recommend masking the reference nucleotide sequences prior to building an alignment database, assuming you are working with a read aligner that is designed to deal with masked sequences.

For example, you could mask the human genome using dustmasker, follwed by using 'sed' to convert all lower-case soft-masked bases into 'N'. Then, you can use Bowtie2, which is designed to be able to handle ambiguous nucleotides:

http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#ambiguous-characters

• It is not clear to me how well-validated older tools like 'dustmasker' are for ultra-short NGS reads, since they were designed back in the day of Sanger sequencing
• Once you've masked your reference database ONCE, you don't have to worry about doing so for all future FASTQ files that you want to align, as it is redundant to mask both the reference and the short reads
1
Entering edit mode

Once you've masked your reference database ONCE, you don't have to worry about doing so for all future FASTQ files that you want to align, as it is redundant to mask both the reference and the short reads

Not sure I agree with this... reads can still map somewhere. If you use a masked reference, reads from the masked region will simply map to the wrong location (the next-closest match), yielding incorrect results. For this reason, I never recommend using masked references except for the purpose of contamination filtering.

0
Entering edit mode

I used masked fasta files to make genome alignments using LASTZ, but I still have a lot of repetitive regions. I wonder if there is any algorithm I can use to remove this repetitive regions in the MAF files that resulted from the alignments.