Question: high discordant pair % with Tophat2 on Galaxy
1
gravatar for will.schachterle
5.5 years ago by
United States
will.schachterle30 wrote:

I'm totally new to all NGS tools so please bear with me.

I've been trying to analyze data some paired reads with the tools on Galaxy. After "grooming" the FASTQ files, I attempted to map the reads with Tophat2, following the tutorial here (https://docs.uabgrid.uab.edu/wiki/UAB_Galaxy_RNA_Seq_Step_by_Step_Tutorial). The alignment summary is pasted below:

Left reads:
               Input:  36728489
              Mapped:  35783336 (97.4% of input)
            of these:  10212030 (28.5%) have multiple alignments (4222 have >20)
Right reads:
               Input:  35892294
              Mapped:  35027752 (97.6% of input)
            of these:  10007714 (28.6%) have multiple alignments (4283 have >20)
97.5% overall read alignment rate.
Aligned pairs:  34152758
     of these:   2784707 ( 8.2%) have multiple alignments
          and:  34025873 (99.6%) are discordant alignments
 0.4% concordant pair alignment rate.

Why are there so many "discordant alignments?" And is this why I'm getting such a low % of "properly paired" reads according to flagstat (pasted below)?

167038160 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
167038160 + 0 mapped (100.00%:-nan%)
167038160 + 0 paired in sequencing
84138107 + 0 read1
82900053 + 0 read2
16240 + 0 properly paired (0.01%:-nan%)
162899512 + 0 with itself and mate mapped
4138648 + 0 singletons (2.48%:-nan%)
154361318 + 0 with mate mapped to a different chr
42840562 + 0 with mate mapped to a different chr (mapQ>=5)

And just for good measure, here is output from Picard Alignment summary

CATEGORY
FIRST_OF_PAIR 
TOTAL_READS
35796836 
PF_READS
35796836 
PCT_PF_READS
1 
PF_NOISE_READS
8 
PF_READS_ALIGNED
35796828 
PCT_PF_READS_ALIGNED
1 
PF_ALIGNED_BASES
1532498561 
PF_HQ_ALIGNED_READS
24005633 
PF_HQ_ALIGNED_BASES
1029373773 
PF_HQ_ALIGNED_Q20_BASES
1029373773 
PF_HQ_MEDIAN_MISMATCHES
0 
PF_MISMATCH_RATE
0.282276 
PF_HQ_ERROR_RATE
0.284893 
PF_INDEL_RATE
0.000134 
MEAN_READ_LENGTH
51 
READS_ALIGNED_IN_PAIRS
34152750 
PCT_READS_ALIGNED_IN_PAIRS
0.954072 
BAD_CYCLES
0 
STRAND_BALANCE
0.500655 
PCT_CHIMERAS
0.999132 
PCT_ADAPTER
0 

I expect in order to answer these questions, one would need to know more about the quality of my reads, but as I said, I'm really flying blind here, so I can't even anticipate what additional info, you might need. I can say that according to my FastQC results on 223 and 224 (history #234 and #235), both look like they have lengths of 51bp. Also, in the FastQc results I'm seeing per base sequence qualities that are initially around 33, but then are steady >35 for the rest of the read. I apologize for what is probably a stupid question. Hey, I gotta start somewhere.

rna-seq tophat2 galaxy • 5.6k views
ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by will.schachterle30
1

How did you groom the reads? It's likely that the files became out of sync at that step.

ADD REPLYlink written 5.5 years ago by Devon Ryan92k

The initial file format was fastqillumina

Input FASTQ quality scores type: Sanger + Illumina 1.8+

Output FASTQ quality scores type: Sanger

Force Quality Score encoding: ASCII

Summarize input data: Summarize input

ADD REPLYlink written 5.5 years ago by will.schachterle30
1

My question was actually more whether you did the "NGS: QC and manipulation > FASTX-TOOLKIT FOR FASTQ DATA" step. Fastx is known to be problematic with paired-end reads. As a rule if you have paired-end reads, don't use any tool that can't accept both fastq files at once (except FastQC). I assume there's a trimmer (trimmomatic, Trim Galore!, etc.) available through galaxy that can properly handle paired-end reads. If not, you can trim locally (trimmomatic is in Java, so it'll even run on Windows) and then upload the resulting 2 files.

ADD REPLYlink written 5.5 years ago by Devon Ryan92k

I did NGS: QC and manipulation > Filter FASTQ and set a min quality to 20. I see that there's a checkbox for "This is paired end data" and I'm pretty sure I checked that. But I can run this again, and be certain to check it. I don't actually think this filtering is really necessary because my reads seem good: quality scores > 33 through read and sizes are all 51 bp. I am getting a problem with sequence duplication levels >=73%.

ADD REPLYlink written 5.5 years ago by will.schachterle30

The duplication level (I assume you're getting this from FastQC) is fairly common for RNAseq, so I wouldn't worry about it. That's only really useful for looking at DNA sequencing, where you expect to more uniformly cover things (in RNAseq, anything really highly expressed will throw the duplication level metric off).

Try aligning the untrimmed/ungroomed reads and see if you get the same results. If so, perhaps try to post the fastq files somewhere and one of us can have a look at what might have gone wrong (please note the species you're working on if you do that!).

ADD REPLYlink written 5.5 years ago by Devon Ryan92k

What's your reference genome? I see where you said you're following a tutorial, but if you're mapping against a draft reference or a bacterial/viral reference with really short chromosomes, then it's not at all surprising that you'd see reads mapping to different chromosomes/contigs.

The result for PCT_CHIMERAS is really interesting, too. Not yet sure what to make of that.

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Dan D6.9k

Actually after thinking about it a little more, the PCT_CHIMERAS has to be associated with your discordant alignment %. Tell me more about your samples. What kind of prep was used to make them, and what's the source organism? Did you see overrepresented sequences in your FASTQ report?

ADD REPLYlink written 5.5 years ago by Dan D6.9k

I see PCT_CHIMERAS is, "The percentage of reads that map outside of a maximum insert size (usually 100kb) or that have the two ends mapping to different chromosomes" so that sounds the same as discordant alignments.

These are Qiagen prepped RNA samples from mouse ECs. The mapping is with mm9. I'm not sure if I see overrepresented sequences, or do I? I am getting a problem with sequence duplication levels >=73%.

ADD REPLYlink written 5.5 years ago by will.schachterle30

This one is a doozie. Any chance you could make your accepted_hits.bam available for download via Dropbox, Google Drive, or otherwise?

 

It would also be helpful to know what settings you used. Perhaps you can post a screenshot of your Galaxy settings? The only thing I can think of right now is that the strandedness specification is wrong...

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Dan D6.9k
2
gravatar for will.schachterle
5.5 years ago by
United States
will.schachterle30 wrote:

Mystery solved? I think the problem was the NGS: QC and manipulation->Filter FASTQ. It looks like Devon's suggestion that the samples were getting unpaired at the filtering step was correct. As I said, the quality of the reads looks fine to me, with #s >=33 from bp 1-51. So I guess I'll just skip the filtering. I did Tophat2 on groomed but unfiltered paired sequences and got

Left reads:
               Input:  41804842
              Mapped:  40302717 (96.4% of input)
            of these:  11427572 (28.4%) have multiple alignments (9409 have >20)
Right reads:
               Input:  41804842
              Mapped:  40038736 (95.8% of input)
            of these:  11348749 (28.3%) have multiple alignments (9335 have >20)
96.1% overall read alignment rate.

Aligned pairs:  39528056
     of these:   8675160 (21.9%) have multiple alignments
          and:   2109241 ( 5.3%) are discordant alignments
89.5% concordant pair alignment rate.

These are the flagstat results: 

111549410 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
111549410 + 0 mapped (100.00%:-nan%)
111549410 + 0 paired in sequencing
56065754 + 0 read1
55483656 + 0 read2
77803296 + 0 properly paired (69.75%:-nan%)
109184826 + 0 with itself and mate mapped
2364584 + 0 singletons (2.12%:-nan%)
15059778 + 0 with mate mapped to a different chr
1794200 + 0 with mate mapped to a different chr (mapQ>=5)

These are the Picard results: 

CATEGORY FIRST_OF_PAIR 
TOTAL_READS 40479451 
PF_READS 40479451 
PCT_PF_READS
PF_NOISE_READS 10 
PF_READS_ALIGNED 40479441 
PCT_PF_READS_ALIGNED
PF_ALIGNED_BASES 1734697303 
PF_HQ_ALIGNED_READS 34923316 
PF_HQ_ALIGNED_BASES 1496897300 
PF_HQ_ALIGNED_Q20_BASES 1485878393 
PF_HQ_MEDIAN_MISMATCHES
PF_MISMATCH_RATE 0.279758 
PF_HQ_ERROR_RATE 0.281356 
PF_INDEL_RATE 0.000147 
MEAN_READ_LENGTH 51 
READS_ALIGNED_IN_PAIRS 39528047 
PCT_READS_ALIGNED_IN_PAIRS 0.976497 
BAD_CYCLES
STRAND_BALANCE 0.500255 
PCT_CHIMERAS 0.05368 
PCT_ADAPTER  

 

Flagstat's 69.75% properly paired is acceptable, no? On to Cufflinks? 

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by will.schachterle30

That looks much more reasonable. The number of reads aligning to different chromosomes still seems a bit high, but perhaps that's expected for this dataset (e.g., it's cancer or you have reads from species A and aligning them to species B).

ADD REPLYlink written 5.5 years ago by Devon Ryan92k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1386 users visited in the last hour