high discordant pair % with Tophat2 on Galaxy
1
1
Entering edit mode
6.9 years ago

I'm totally new to all NGS tools so please bear with me.

I've been trying to analyze data some paired reads with the tools on Galaxy. After "grooming" the FASTQ files, I attempted to map the reads with Tophat2, following the tutorial here (https://docs.uabgrid.uab.edu/wiki/UAB_Galaxy_RNA_Seq_Step_by_Step_Tutorial). The alignment summary is pasted below:

Left reads:
Input:  36728489
Mapped:  35783336 (97.4% of input)
of these:  10212030 (28.5%) have multiple alignments (4222 have >20)
Input:  35892294
Mapped:  35027752 (97.6% of input)
of these:  10007714 (28.6%) have multiple alignments (4283 have >20)
97.5% overall read alignment rate.
Aligned pairs:  34152758
of these:   2784707 ( 8.2%) have multiple alignments
and:  34025873 (99.6%) are discordant alignments
0.4% concordant pair alignment rate.

Why are there so many "discordant alignments?" And is this why I'm getting such a low % of "properly paired" reads according to flagstat (pasted below)?

167038160 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
167038160 + 0 mapped (100.00%:-nan%)
167038160 + 0 paired in sequencing
84138107 + 0 read1
82900053 + 0 read2
16240 + 0 properly paired (0.01%:-nan%)
162899512 + 0 with itself and mate mapped
4138648 + 0 singletons (2.48%:-nan%)
154361318 + 0 with mate mapped to a different chr
42840562 + 0 with mate mapped to a different chr (mapQ>=5)

And just for good measure, here is output from Picard Alignment summary

 CATEGORY FIRST_OF_PAIR  TOTAL_READS 35796836  PF_READS 35796836  PCT_PF_READS 1  PF_NOISE_READS 8  PF_READS_ALIGNED 35796828  PCT_PF_READS_ALIGNED 1  PF_ALIGNED_BASES 1532498561  PF_HQ_ALIGNED_READS 24005633  PF_HQ_ALIGNED_BASES 1029373773  PF_HQ_ALIGNED_Q20_BASES 1029373773  PF_HQ_MEDIAN_MISMATCHES 0  PF_MISMATCH_RATE 0.282276  PF_HQ_ERROR_RATE 0.284893  PF_INDEL_RATE 0.000134  MEAN_READ_LENGTH 51  READS_ALIGNED_IN_PAIRS 34152750  PCT_READS_ALIGNED_IN_PAIRS 0.954072  BAD_CYCLES 0  STRAND_BALANCE 0.500655  PCT_CHIMERAS 0.999132  PCT_ADAPTER 0

I expect in order to answer these questions, one would need to know more about the quality of my reads, but as I said, I'm really flying blind here, so I can't even anticipate what additional info, you might need. I can say that according to my FastQC results on 223 and 224 (history #234 and #235), both look like they have lengths of 51bp. Also, in the FastQc results I'm seeing per base sequence qualities that are initially around 33, but then are steady >35 for the rest of the read. I apologize for what is probably a stupid question. Hey, I gotta start somewhere.

rna-seq galaxy tophat2 • 6.0k views
1
Entering edit mode

How did you groom the reads? It's likely that the files became out of sync at that step.

0
Entering edit mode

The initial file format was fastqillumina

• Input FASTQ quality scores type: Sanger + Illumina 1.8+
• Output FASTQ quality scores type: Sanger
• Force Quality Score encoding: ASCII
• Summarize input data: Summarize input
1
Entering edit mode

My question was actually more whether you did the "NGS: QC and manipulation > FASTX-TOOLKIT FOR FASTQ DATA" step. Fastx is known to be problematic with paired-end reads. As a rule if you have paired-end reads, don't use any tool that can't accept both fastq files at once (except FastQC). I assume there's a trimmer (trimmomatic, Trim Galore!, etc.) available through galaxy that can properly handle paired-end reads. If not, you can trim locally (trimmomatic is in Java, so it'll even run on Windows) and then upload the resulting 2 files.

0
Entering edit mode

I did NGS: QC and manipulation > Filter FASTQ and set a min quality to 20. I see that there's a checkbox for "This is paired end data" and I'm pretty sure I checked that. But I can run this again, and be certain to check it. I don't actually think this filtering is really necessary because my reads seem good: quality scores > 33 through read and sizes are all 51 bp. I am getting a problem with sequence duplication levels >=73%.

0
Entering edit mode

The duplication level (I assume you're getting this from FastQC) is fairly common for RNAseq, so I wouldn't worry about it. That's only really useful for looking at DNA sequencing, where you expect to more uniformly cover things (in RNAseq, anything really highly expressed will throw the duplication level metric off).

Try aligning the untrimmed/ungroomed reads and see if you get the same results. If so, perhaps try to post the fastq files somewhere and one of us can have a look at what might have gone wrong (please note the species you're working on if you do that!).

0
Entering edit mode

What's your reference genome? I see where you said you're following a tutorial, but if you're mapping against a draft reference or a bacterial/viral reference with really short chromosomes, then it's not at all surprising that you'd see reads mapping to different chromosomes/contigs.

The result for PCT_CHIMERAS is really interesting, too. Not yet sure what to make of that.

0
Entering edit mode

Actually after thinking about it a little more, the PCT_CHIMERAS has to be associated with your discordant alignment %. Tell me more about your samples. What kind of prep was used to make them, and what's the source organism? Did you see overrepresented sequences in your FASTQ report?

0
Entering edit mode

I see PCT_CHIMERAS is, "The percentage of reads that map outside of a maximum insert size (usually 100kb) or that have the two ends mapping to different chromosomes" so that sounds the same as discordant alignments.

These are Qiagen prepped RNA samples from mouse ECs. The mapping is with mm9. I'm not sure if I see overrepresented sequences, or do I? I am getting a problem with sequence duplication levels >=73%.

0
Entering edit mode

This one is a doozie. Any chance you could make your accepted_hits.bam available for download via Dropbox, Google Drive, or otherwise?

It would also be helpful to know what settings you used. Perhaps you can post a screenshot of your Galaxy settings? The only thing I can think of right now is that the strandedness specification is wrong...

2
Entering edit mode
6.9 years ago

Mystery solved? I think the problem was the NGS: QC and manipulation->Filter FASTQ. It looks like Devon's suggestion that the samples were getting unpaired at the filtering step was correct. As I said, the quality of the reads looks fine to me, with #s >=33 from bp 1-51. So I guess I'll just skip the filtering. I did Tophat2 on groomed but unfiltered paired sequences and got

Left reads:
Input:  41804842
Mapped:  40302717 (96.4% of input)
of these:  11427572 (28.4%) have multiple alignments (9409 have >20)
Input:  41804842
Mapped:  40038736 (95.8% of input)
of these:  11348749 (28.3%) have multiple alignments (9335 have >20)

96.1% overall read alignment rate.

Aligned pairs:  39528056
of these:   8675160 (21.9%) have multiple alignments
and:   2109241 ( 5.3%) are discordant alignments
89.5% concordant pair alignment rate.


These are the flagstat results:

111549410 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
111549410 + 0 mapped (100.00%:-nan%)
111549410 + 0 paired in sequencing
56065754 + 0 read1
55483656 + 0 read2
77803296 + 0 properly paired (69.75%:-nan%)
109184826 + 0 with itself and mate mapped
2364584 + 0 singletons (2.12%:-nan%)
15059778 + 0 with mate mapped to a different chr
1794200 + 0 with mate mapped to a different chr (mapQ>=5)


These are the Picard results:

CATEGORY                    FIRST_OF_PAIR
PF_ALIGNED_BASES            1734697303
PF_HQ_ALIGNED_BASES         1496897300
PF_HQ_ALIGNED_Q20_BASES     1485878393
PF_HQ_MEDIAN_MISMATCHES     0
PF_MISMATCH_RATE            0.279758
PF_HQ_ERROR_RATE            0.281356
PF_INDEL_RATE               0.000147
STRAND_BALANCE              0.500255
PCT_CHIMERAS                0.05368


Flagstat's 69.75% properly paired is acceptable, no? On to Cufflinks?

0
Entering edit mode

That looks much more reasonable. The number of reads aligning to different chromosomes still seems a bit high, but perhaps that's expected for this dataset (e.g., it's cancer or you have reads from species A and aligning them to species B).