Question

Tool Recommendation Wanted For Cleaning Fasta/Fastq Files To Remove Unpaired Reads Following Pre-Processing

0

Entering edit mode

10.5 years ago

Moss ▴ 20

Hi Everyone, I've been digging around the web trying to find a tool that would allow me to clean-up my paired-end Illumina data before mapping. My pipeline thus far has been to:

1) FASTQC - my R1 file had a bit of adaptor contamination, the R2 file was fine. 2) fastx_collapser - I had a lot of data and am just mapping to determine coverage of the genome (of closely related species) to see how broad our coverage is before other analysis begins - ran on R1 and R2 seperately (files were left with a different number of sequences although it was <1% of the total number of sequences) 3) fastx_clipper - only on the file with the adaptor contamination - removed sequences containing the adaptor 4) fix pairing data - ? tool

I saw there was some tool referred to as rePair, but I have not been able to track it down. I thought for sure that fastx or picard would have something to filter out unpaired reads, but I'm just not seeing it. I'm hoping there is any easy answer here. I am planning to use bowtie2 for the alignment. Thanks in advance!

paired-end • 11k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 10.5 years ago by Moss ▴ 20

0

Entering edit mode

Thanks dpryan79, in the end I decided that I could concatenate the collapsed files and map them as though they were single reads. This will work for just looking at coverage of a closely related genome, but wouldn't work for any solid, in-depth analysis. Since I am just double checking the sequencing protocol gives sufficient coverage (not talking depth here) of the genome, this should work fine. If anyone else was considering using the pipeline I described above, don't do it. The problem is that you lose the headers by collapsing the reads using the fastx tools. Better to do as dpryan79 suggests and just map all the reads and collapse/remove redundant reads after the fact. I believe samtools and picard both have tools for reducing redundancy in sam/bam files.

ADD REPLY • link 10.5 years ago by Moss ▴ 20

Ram · Answer 1 · 2014-07-30

3

Entering edit mode

9.7 years ago

Biomonika (Noolean) 3.2k

This script outputs pairs and solo reads separately.

So, either use Trimmomatic that keeps pairing our use your favorite software that will leave you with unequal number of sequences and then fix pairing with this script (written by Eric Normandeau).

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.7 years ago by Biomonika (Noolean) 3.2k

1

Entering edit mode

The script is still available and multiple people are reporting using it with success.

ADD REPLY • link 7.0 years ago by Eric Normandeau 11k

1

Entering edit mode

Dear Eric, it works perfectly as described, I confirm. Thanks!

ADD REPLY • link 7.0 years ago by aln ▴ 320

score 2 · Answer 2 · 2013-10-13

Have a look here (How to sort two mate pair (fastq) files so that the order of the identifiers is the same?) or here (Combining the paired reads from Illumina run) for solutions to resyncing fastq files. In general, it's probably faster to simply map those reads rather than collapsing them and then needing to resync your files.

score 0 · Answer 3 · 2013-10-14

0

Entering edit mode

10.5 years ago

Ian 6.0k

I would recommend Trimmomatic as it performs read filtering/trimming, etc, and maintains paired filtered reads whilst removing singletons.

ADD COMMENT • link 10.5 years ago by Ian 6.0k