TopHat2 high discordant alignment percentage?
0
0
Entering edit mode
2.7 years ago
edsouza ▴ 30

I started out with two massive paired-end FASTQ files with a dozen barcodes each. Our lab has some custom scripts, which I used to perform paired-end adapter trimming using cutadapt, and then to separate each barcode sample into a separate FASTA file. I independently verified that in both the .1.fa and .2.fa files, all of the reads correspond to each other. However, they are in FASTA rather than FASTQ format.

When I aligned one of the resulting barcode sets to a transcriptome, they all resulted in discordant alignments. Here is the end of alignment_summary.txt

Aligned pairs:  17746091

of these:   2799869 (15.8%) have multiple alignments

17746090 (100.0%) are discordant alignments

0.0% concordant pair alignment rate.


I'm very confident that each .1.fa read corresponds to the correct .2.fa read; I ran cutadapt in paired-end mode and the demultiplexing script performed the same sorting on both the .1.fastq and .2.fastq files. Yet, it shows up as 100% discordant. What is the issue here?

I performed this again on a subset of my FASTA files (only 50 reads each) and it again reported 100% discordant. From this, I can guess one of two options: 1) My data is very bad and none of the reads line up properly, or 2) There is some info in the FASTQ files that isn't present in my FASTA files, contributing to this issue.

Is there a way to run paired-end data using FASTA input in TopHat2? Or do I need to go back and modify my script so that only FASTQ files are used?

RNA-Seq alignment • 844 views
0
Entering edit mode

Could the 1.fastq and 2.fastq have been swapped?

You should know that the old 'Tuxedo' pipeline of Tophat(2) and Cufflinks is no longer the "advisable" tool for RNA-seq analysis. The software is deprecated/ in low maintenance and should be replaced by HISAT2, StringTie and ballgown. See this paper: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. There are also other alternatives, including alignment with STAR and bbmap, or pseudo-alignment using salmon followed by DESEq2 or edgeR.

0
Entering edit mode

After some serious digging, I realized my hubris: I was "very confident" that each .1.fa read matched with a .2.fa read. However, digging back through my original datasets, this should come as no surprise because the md5sums of both of the big .1.fastq and .2.fastq files were identical! I'm guessing there was some data duplication error that I should now try to figure out.