Question: Read pairing issues detected in Tophat run
0
gravatar for tunl
3.4 years ago by
tunl60
tunl60 wrote:

I am running Tophat (v2.0.10) as follows:

./tophat2 -p 8 -G genes.gtf --b2-very-sensitive --library-type fr-firststrand -o ./result/CTR Bowtie2index/genome CTR-0.fastq CTR-1.fastq

At the step “Preparing reads” ( prep_reads v2.0.9 (3067M) ), I got:

WARNING: read pairing issues detected (check prep_reads.log) ! Pair #1 name mismatch: HWI-ST1133R:6:1101:1060:2144#AGGCAGCTCTCT/1 vs HWI-ST1133R:7:1101:1166:2068#NGGCAG/1 4266 out of 12827817 reads have been filtered out; 7461 out of 12827817 read mates have been filtered out

I had two other subsequent Tophat runs on two other samples, and also got the read pairing issues (name mismatch) as follows:

WARNING: read pairing issues detected (check prep_reads.log) ! Pair #1 name mismatch: HWI-ST1133R:7:1101:1445:2216#CTCTCTCTCTCT/1 vs HWI-S3R:2:1101:1053:2168#CTCTCTCTCTCT/1 9331 out of 23151044 reads have been filtered out; 138 out of 23151044 read mates have been filtered out

And:

WARNING: read pairing issues detected (check prep_reads.log) ! Pair #1 name mismatch: HWI-ST1133R:6:1101:1392:2151#GGACTCCTCTCT/1 vs HWI-ST1133R:7:1101:1172:2165#NGACTC/1 4210 out of 12330176 reads have been filtered out; 7200 out of 12330176 read mates have been filtered out

Is this read pairing name mismatch a serious problem in running Tophat? What impact does it have?

What could I possibly do to fix this problem?

I’d greatly appreciate any ideas and suggestions.

Thank you very much!

rna-seq tophat • 2.1k views
ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by tunl60
1

Did you trim your paired end data files independently (or using a trimming program that is not PE data aware). That is likely cause of the reads being out of order in your data files. You can use repair.sh from BBMap to fix the read pairing like so

repair.sh in1=r1.fq.gz in2=r2.fq.gz out1=fixed1.fq.gz out2=fixed2.fq.gz outsingle=singletons.fq.gz

Note: You are running an old version of TopHat (almost 2.5 year). That is not a good idea. You should upgrade to the latest (v. 2.1.1), if you are able to.

In terms of impact on alignments, if your reads out of order in the two files then you could get discordant/strange alignments that will not make sense.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by genomax75k

Thank you so much for your advice!

Actually we got the fastq files from other people, so I’m not sure if they were trimmed or not. Is there an easy way to find out whether a fastq file has been trimmed?

If the fastq files are not trimmed, would it also cause this read pairing name mismatch problem?

So when we trim the fastq files, we should not trim the paired-end data files independently, right?

Is Trimmomatic a good tool to trim paired-end fastq files? How about FastQC?

Thank you very much for your help!

ADD REPLYlink written 3.4 years ago by tunl60
1

If all reads are not identical length then that would be an indication that the data has been trimmed. You should be able to see that in FastQC report (general stats at top). FastQC only does QC it does not change data in any way.

Improper trimming is a surefire way of getting reads out of sync. I could think of few additional ways this can happen but they would all have low probability (e.g. corruption during transfer).

Trimmomatic is PE aware trimmer. If you downloaded BBMap suite then you could use BBDuk for trimming/scanning your data.

ADD REPLYlink written 3.4 years ago by genomax75k

Thank you very much for your further help!

So repair.sh can fix the read pairing issues no matter what caused the mismatch, right?

What is the singletons.fq.gz file in the repair.sh command line?

Another question is, when Tophat says some reads and read mates have been filtered out, does this mean the mismatched parts are not aligned at all?

Thank you very much!

ADD REPLYlink written 3.4 years ago by tunl60
1

Yes repair.sh can fix the read order so the two files R1/R2 are in sync again. singletons.fq.gz will contain reads where a mate from a pair may have been completely trimmed out/eliminated/otherwise absent.

Are you referring to this filtering by TopHat: What are the cut-offs during read quality filtering in Bowtie/TopHat before mapping?

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by genomax75k

I was referring to the messages in prep_reads.log (as quoted in the blue box in my posting): 4266 out of 12827817 reads have been filtered out; 7461 out of 12827817 read mates have been filtered out.

Thank you for pointing me to the previous posting. So this “filtering-out” is also a quality control to skip the bad reads. In this case, I am just wondering if Tophat also filtered out the name-mismatched parts so that the name-mismatched parts are not aligned at all?

I ran Cuffdiff on the BAM files created by Tophat, and for some reason, the step “Testing for differential expression and regulation in locus” became extremely slow (only 10% was done after 3 days). So I am just wondering if the name mismatch has anything to do with this slowness. If the name-mismatched parts are filtered out by Tophat (no alignment), can they still appear in the output BAM files and affect Cuffdiff in some way?

I found that our fastq data are actually not trimmed (identical length). Could untrimmed paired-end reads also have name mismatch (if data not corrupted during transfer)?

Thank you very much for your help!

ADD REPLYlink written 3.4 years ago by tunl60
1

I found that our fastq data are actually not trimmed (identical length). Could untrimmed paired-end reads also have name mismatch (if data not corrupted during transfer)?

There is no reason they should be mismatched (unless someone did something to the files).

I am going to point out again that unless you are using the latest TopHat some of these issues may have been known and have since been fixed in latest version of TopHat. Did you upgrade TopHat to latest?

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by genomax75k

Thank you very much for the advice!

I’ll try to upgrade to Tophat 2.1.1 now.

People who provided us fastq files now just told us the data may be ATAC-seq instead of RNA-seq (originally we were told the data is RNA-seq).

So I’m just wondering if untrimmed ATAC-seq paired-end reads could have name mismatch?

Does Tophat process ATAC-seq data and RNA-seq data in any way different?

Thanks a lot for your help!

ADD REPLYlink written 3.4 years ago by tunl60
1

If this is ATAC-seq data then you should not use TopHat for analysis. See this thread for options.

ADD REPLYlink written 3.4 years ago by genomax75k

Thanks a lot for the information!

It looks like that they use Bowtie to map ATAC-seq data.

So does Tophat have issues with mapping the ATAC-seq data?

I thought Tophat uses Bowtie as its alignment engine and Bowtie cannot align reads that span introns...

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by tunl60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2160 users visited in the last hour