Filter Polya/T And Other Repeats In Illumina Reads
2
5
Entering edit mode
13.2 years ago

Hi,

Does anyone know of a standard way of filtering polyA/T reads and other repeats, like GTGTGTGTGT and ACACACACAC from Illumina reads?

Does anyone have a list or library of such sequence patterns that are meaningless for alignment/assembly/etc difficult?

assembly • 5.5k views
ADD COMMENT
5
Entering edit mode
13.2 years ago
Michael 54k

Such regions are called low complexity regions and are masked by tools such as DustMasker (or dust in the old blast) in the NCBI Blast+ toolbox. Restriction afaik: works only on FASTA not FASTQ which is the only restriction for use with Illumina reads. Unfortunately, I don't know of such tool working on FASTQ data.

Edit: Well not knowing bothered me too much, so I searched for the answer. And again it's in Bioconductor, where else?

The ShortRead package contains a method to create user-defined short read filters using srFilter. One of these filters is dustyFilter(threshold=Inf, batchSize=NA, .name="DustyFilter"). That works on every ShortReads object such that you can import/export your files using readFastq and writeFastq.

ADD COMMENT
3
Entering edit mode
10.4 years ago
Chronos ▴ 610

prinseq-lite, which is a fairly efficient command-line Perl script, supports two methods (dust and entropy, with adjustable thresholds) for removing low-complexity reads from your FASTQ files. It also has a separate option to trim poly-A/T tails, without discarding the entire read.

ADD COMMENT
0
Entering edit mode

interesting features, I knew about prinseq but missed this functionality

ADD REPLY

Login before adding your answer.

Traffic: 1537 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6