Question

high discordant pair % with Tophat2 on Galaxy

1

Entering edit mode

9.9 years ago

will.schachterle ▴ 30

I'm totally new to all NGS tools so please bear with me.

I've been trying to analyze data some paired reads with the tools on Galaxy. After "grooming" the FASTQ files, I attempted to map the reads with Tophat2, following the tutorial here. The alignment summary is pasted below:

Left reads:
               Input:  36728489
              Mapped:  35783336 (97.4% of input)
            of these:  10212030 (28.5%) have multiple alignments (4222 have >20)
Right reads:
               Input:  35892294
              Mapped:  35027752 (97.6% of input)
            of these:  10007714 (28.6%) have multiple alignments (4283 have >20)
97.5% overall read alignment rate.
Aligned pairs:  34152758
     of these:   2784707 ( 8.2%) have multiple alignments
          and:  34025873 (99.6%) are discordant alignments
 0.4% concordant pair alignment rate.

Why are there so many "discordant alignments"? And is this why I'm getting such a low % of "properly paired" reads according to flagstat (pasted below)?

167038160 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
167038160 + 0 mapped (100.00%:-nan%)
167038160 + 0 paired in sequencing
84138107 + 0 read1
82900053 + 0 read2
16240 + 0 properly paired (0.01%:-nan%)
162899512 + 0 with itself and mate mapped
4138648 + 0 singletons (2.48%:-nan%)
154361318 + 0 with mate mapped to a different chr
42840562 + 0 with mate mapped to a different chr (mapQ>=5)

And just for good measure, here is output from Picard Alignment summary

CATEGORY    FIRST_OF_PAIR
TOTAL_READS    35796836
PF_READS    35796836
PCT_PF_READS    1
PF_NOISE_READS    8
PF_READS_ALIGNED    35796828
PCT_PF_READS_ALIGNED    1
PF_ALIGNED_BASES    1532498561
PF_HQ_ALIGNED_READS    24005633
PF_HQ_ALIGNED_BASES    1029373773
PF_HQ_ALIGNED_Q20_BASES    1029373773
PF_HQ_MEDIAN_MISMATCHES    0
PF_MISMATCH_RATE    0.282276
PF_HQ_ERROR_RATE    0.284893
PF_INDEL_RATE    0.000134
MEAN_READ_LENGTH    51
READS_ALIGNED_IN_PAIRS    34152750
PCT_READS_ALIGNED_IN_PAIRS    0.954072
BAD_CYCLES    0
STRAND_BALANCE    0.500655
PCT_CHIMERAS    0.999132
PCT_ADAPTER    0

I expect in order to answer these questions, one would need to know more about the quality of my reads, but as I said, I'm really flying blind here, so I can't even anticipate what additional info, you might need. I can say that according to my FastQC results on 223 and 224 (history #234 and #235), both look like they have lengths of 51bp. Also, in the FastQc results I'm seeing per base sequence qualities that are initially around 33, but then are steady >35 for the rest of the read. I apologize for what is probably a stupid question. Hey, I gotta start somewhere.

galaxy tophat2 rna-seq • 7.3k views

ADD COMMENT • link updated 2.6 years ago by Ram 43k • written 9.9 years ago by will.schachterle ▴ 30

1

Entering edit mode

How did you groom the reads? It's likely that the files became out of sync at that step.

ADD REPLY • link 9.9 years ago by Devon Ryan 104k

0

Entering edit mode

The initial file format was fastqillumina

Input FASTQ quality scores type: Sanger + Illumina 1.8+
Output FASTQ quality scores type: Sanger
Force Quality Score encoding: ASCII
Summarize input data: Summarize input

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by will.schachterle ▴ 30

1

Entering edit mode

My question was actually more whether you did the "NGS: QC and manipulation > FASTX-TOOLKIT FOR FASTQ DATA" step. Fastx is known to be problematic with paired-end reads. As a rule if you have paired-end reads, don't use any tool that can't accept both fastq files at once (except FastQC). I assume there's a trimmer (trimmomatic, Trim Galore!, etc.) available through galaxy that can properly handle paired-end reads. If not, you can trim locally (trimmomatic is in Java, so it'll even run on Windows) and then upload the resulting 2 files.

ADD REPLY • link 9.9 years ago by Devon Ryan 104k

0

Entering edit mode

I did NGS: QC and manipulation > Filter FASTQ and set a min quality to 20. I see that there's a checkbox for "This is paired end data" and I'm pretty sure I checked that. But I can run this again, and be certain to check it. I don't actually think this filtering is really necessary because my reads seem good: quality scores > 33 through read and sizes are all 51 bp. I am getting a problem with sequence duplication levels >=73%.

ADD REPLY • link 9.9 years ago by will.schachterle ▴ 30

0

Entering edit mode

The duplication level (I assume you're getting this from FastQC) is fairly common for RNAseq, so I wouldn't worry about it. That's only really useful for looking at DNA sequencing, where you expect to more uniformly cover things (in RNAseq, anything really highly expressed will throw the duplication level metric off).

Try aligning the untrimmed/ungroomed reads and see if you get the same results. If so, perhaps try to post the fastq files somewhere and one of us can have a look at what might have gone wrong (please note the species you're working on if you do that!).

ADD REPLY • link 9.9 years ago by Devon Ryan 104k

0

Entering edit mode

What's your reference genome? I see where you said you're following a tutorial, but if you're mapping against a draft reference or a bacterial/viral reference with really short chromosomes, then it's not at all surprising that you'd see reads mapping to different chromosomes/contigs.

The result for PCT_CHIMERAS is really interesting, too. Not yet sure what to make of that.

ADD REPLY • link 9.9 years ago by Dan D 7.4k

0

Entering edit mode

Actually after thinking about it a little more, the PCT_CHIMERAS has to be associated with your discordant alignment %. Tell me more about your samples. What kind of prep was used to make them, and what's the source organism? Did you see overrepresented sequences in your FASTQ report?

ADD REPLY • link 9.9 years ago by Dan D 7.4k

0

Entering edit mode

I see PCT_CHIMERAS is, "The percentage of reads that map outside of a maximum insert size (usually 100kb) or that have the two ends mapping to different chromosomes" so that sounds the same as discordant alignments.

These are Qiagen prepped RNA samples from mouse ECs. The mapping is with mm9. I'm not sure if I see overrepresented sequences, or do I? I am getting a problem with sequence duplication levels >=73%.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by will.schachterle ▴ 30

0

Entering edit mode

This one is a doozie. Any chance you could make your accepted_hits.bam available for download via Dropbox, Google Drive, or otherwise?

It would also be helpful to know what settings you used. Perhaps you can post a screenshot of your Galaxy settings? The only thing I can think of right now is that the strandedness specification is wrong...

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Dan D 7.4k

Ram · Answer 1 · 2014-05-22

Mystery solved? I think the problem was the NGS: QC and manipulation->Filter FASTQ. It looks like Devon's suggestion that the samples were getting unpaired at the filtering step was correct. As I said, the quality of the reads looks fine to me, with #s >=33 from bp 1-51. So I guess I'll just skip the filtering. I did Tophat2 on groomed but unfiltered paired sequences and got

Left reads:
               Input:  41804842
              Mapped:  40302717 (96.4% of input)
            of these:  11427572 (28.4%) have multiple alignments (9409 have >20)
Right reads:
               Input:  41804842
              Mapped:  40038736 (95.8% of input)
            of these:  11348749 (28.3%) have multiple alignments (9335 have >20)

96.1% overall read alignment rate.

Aligned pairs:  39528056
     of these:   8675160 (21.9%) have multiple alignments
          and:   2109241 ( 5.3%) are discordant alignments
89.5% concordant pair alignment rate.

These are the flagstat results:

111549410 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
111549410 + 0 mapped (100.00%:-nan%)
111549410 + 0 paired in sequencing
56065754 + 0 read1
55483656 + 0 read2
77803296 + 0 properly paired (69.75%:-nan%)
109184826 + 0 with itself and mate mapped
2364584 + 0 singletons (2.12%:-nan%)
15059778 + 0 with mate mapped to a different chr
1794200 + 0 with mate mapped to a different chr (mapQ>=5)

These are the Picard results:

CATEGORY                    FIRST_OF_PAIR
TOTAL_READS                 40479451
PF_READS                    40479451
PCT_PF_READS                1
PF_NOISE_READS              10
PF_READS_ALIGNED            40479441
PCT_PF_READS_ALIGNED        1
PF_ALIGNED_BASES            1734697303
PF_HQ_ALIGNED_READS         34923316
PF_HQ_ALIGNED_BASES         1496897300
PF_HQ_ALIGNED_Q20_BASES     1485878393
PF_HQ_MEDIAN_MISMATCHES     0
PF_MISMATCH_RATE            0.279758
PF_HQ_ERROR_RATE            0.281356
PF_INDEL_RATE               0.000147
MEAN_READ_LENGTH            51
READS_ALIGNED_IN_PAIRS      39528047
PCT_READS_ALIGNED_IN_PAIRS  0.976497
BAD_CYCLES                  0
STRAND_BALANCE              0.500255
PCT_CHIMERAS                0.05368
PCT_ADAPTER

Flagstat's 69.75% properly paired is acceptable, no? On to Cufflinks?