Question: Tool Recommendation Wanted For Cleaning Fasta/Fastq Files To Remove Unpaired Reads Following Pre-Processing
0
gravatar for Moss
5.4 years ago by
Moss20
Cleveland, OH
Moss20 wrote:

Hi Everyone, I've been digging around the web trying to find a tool that would allow me to clean-up my paired-end Illumina data before mapping. My pipeline thus far has been to:

1) FASTQC - my R1 file had a bit of adaptor contamination, the R2 file was fine. 2) fastx_collapser - I had a lot of data and am just mapping to determine coverage of the genome (of closely related species) to see how broad our coverage is before other analysis begins - ran on R1 and R2 seperately (files were left with a different number of sequences although it was <1% of the total number of sequences) 3) fastx_clipper - only on the file with the adaptor contamination - removed sequences containing the adaptor 4) fix pairing data - ? tool

I saw there was some tool referred to as rePair, but I have not been able to track it down. I thought for sure that fastx or picard would have something to filter out unpaired reads, but I'm just not seeing it. I'm hoping there is any easy answer here. I am planning to use bowtie2 for the alignment. Thanks in advance!

paired-end • 8.0k views
ADD COMMENTlink modified 4.6 years ago by Biomonika (Noolean)3.0k • written 5.4 years ago by Moss20

Thanks dpryan79, in the end I decided that I could concatenate the collapsed files and map them as though they were single reads. This will work for just looking at coverage of a closely related genome, but wouldn't work for any solid, in-depth analysis. Since I am just double checking the sequencing protocol gives sufficient coverage (not talking depth here) of the genome, this should work fine. If anyone else was considering using the pipeline I described above, don't do it. The problem is that you lose the headers by collapsing the reads using the fastx tools. Better to do as dpryan79 suggests and just map all the reads and collapse/remove redundant reads after the fact. I believe samtools and picard both have tools for reducing redundancy in sam/bam files.

ADD REPLYlink written 5.4 years ago by Moss20
3
gravatar for Biomonika (Noolean)
4.6 years ago by
State College, PA, USA
Biomonika (Noolean)3.0k wrote:

This script outputs pairs and solo reads separately:

https://github.com/enormandeau/Scripts/blob/master/fastqCombinePairedEnd.py

So, either use Trimmomatic that keeps pairing our use your favorite software that will leave you with unequal number of sequences and then fix pairing with this script (written by Eric Normandeau). 

 

ADD COMMENTlink written 4.6 years ago by Biomonika (Noolean)3.0k
1

The script is still available and multiple people are reporting using it with success.

ADD REPLYlink written 23 months ago by Eric Normandeau10k
1

Dear Eric, it works perfectly as described, I confirm. Thanks!

ADD REPLYlink written 22 months ago by aln260
2
gravatar for Devon Ryan
5.4 years ago by
Devon Ryan88k
Freiburg, Germany
Devon Ryan88k wrote:

Have a look here (How to sort two mate pair (fastq) files so that the order of the identifiers is the same?) or here (Combining the paired reads from Illumina run) for solutions to resyncing fastq files. In general, it's probably faster to simply map those reads rather than collapsing them and then needing to resync your files.

ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by Devon Ryan88k
0
gravatar for Ian
5.4 years ago by
Ian5.3k
University of Manchester, UK
Ian5.3k wrote:

I would recommend Trimmomatic as it performs read filtering/trimming, etc, and maintains paired filtered reads whilst removing singletons.

ADD COMMENTlink written 5.4 years ago by Ian5.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 724 users visited in the last hour