Question: Tool Recommendation Wanted For Cleaning Fasta/Fastq Files To Remove Unpaired Reads Following Pre-Processing
gravatar for Moss
6.5 years ago by
Cleveland, OH
Moss20 wrote:

Hi Everyone, I've been digging around the web trying to find a tool that would allow me to clean-up my paired-end Illumina data before mapping. My pipeline thus far has been to:

1) FASTQC - my R1 file had a bit of adaptor contamination, the R2 file was fine. 2) fastx_collapser - I had a lot of data and am just mapping to determine coverage of the genome (of closely related species) to see how broad our coverage is before other analysis begins - ran on R1 and R2 seperately (files were left with a different number of sequences although it was <1% of the total number of sequences) 3) fastx_clipper - only on the file with the adaptor contamination - removed sequences containing the adaptor 4) fix pairing data - ? tool

I saw there was some tool referred to as rePair, but I have not been able to track it down. I thought for sure that fastx or picard would have something to filter out unpaired reads, but I'm just not seeing it. I'm hoping there is any easy answer here. I am planning to use bowtie2 for the alignment. Thanks in advance!

paired-end • 8.7k views
ADD COMMENTlink modified 5.7 years ago by Biomonika (Noolean)3.1k • written 6.5 years ago by Moss20

Thanks dpryan79, in the end I decided that I could concatenate the collapsed files and map them as though they were single reads. This will work for just looking at coverage of a closely related genome, but wouldn't work for any solid, in-depth analysis. Since I am just double checking the sequencing protocol gives sufficient coverage (not talking depth here) of the genome, this should work fine. If anyone else was considering using the pipeline I described above, don't do it. The problem is that you lose the headers by collapsing the reads using the fastx tools. Better to do as dpryan79 suggests and just map all the reads and collapse/remove redundant reads after the fact. I believe samtools and picard both have tools for reducing redundancy in sam/bam files.

ADD REPLYlink written 6.5 years ago by Moss20
gravatar for Biomonika (Noolean)
5.7 years ago by
State College, PA, USA
Biomonika (Noolean)3.1k wrote:

This script outputs pairs and solo reads separately:

So, either use Trimmomatic that keeps pairing our use your favorite software that will leave you with unequal number of sequences and then fix pairing with this script (written by Eric Normandeau). 


ADD COMMENTlink written 5.7 years ago by Biomonika (Noolean)3.1k

The script is still available and multiple people are reporting using it with success.

ADD REPLYlink written 3.0 years ago by Eric Normandeau10k

Dear Eric, it works perfectly as described, I confirm. Thanks!

ADD REPLYlink written 2.9 years ago by aln290
gravatar for Devon Ryan
6.5 years ago by
Devon Ryan94k
Freiburg, Germany
Devon Ryan94k wrote:

Have a look here (How to sort two mate pair (fastq) files so that the order of the identifiers is the same?) or here (Combining the paired reads from Illumina run) for solutions to resyncing fastq files. In general, it's probably faster to simply map those reads rather than collapsing them and then needing to resync your files.

ADD COMMENTlink modified 6.5 years ago • written 6.5 years ago by Devon Ryan94k
gravatar for Ian
6.5 years ago by
University of Manchester, UK
Ian5.6k wrote:

I would recommend Trimmomatic as it performs read filtering/trimming, etc, and maintains paired filtered reads whilst removing singletons.

ADD COMMENTlink written 6.5 years ago by Ian5.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1238 users visited in the last hour