Question: Low mapping percentage with TopHat
0
gravatar for Sapi
21 months ago by
Sapi0
USA
Sapi0 wrote:

I am new to RNA-seq data analysis and trying to understand how things work, please excuse me if I sound naive. I have 1X150 bp Hiseq 3000 reads and the sequence quality looks good in FastQC. There was almost zero adapter contamination (few libraries showed 0.1% adapter sequence), so we proceeded without any adapter trimming. We used the most recent reference genome available in the database and average mapping percentage was 38% with TopHat. We got pretty good number of differentially expressing transcripts using cufflinks and most of them looked quite relevant. Later, we realized that the mapping percentage looks too low, so shifted to Hisat2. After running the same untrimmed data with the same reference genome, there is considerable improvement in the mapping percentage (average is 55%).
My first question is why has the mapping percentage changed so much when I am still using the same input data (without any trimming) and same reference genome and secondly is it okay to proceed with this Hisat2 data or should I trim the adapters and run the Hisat2 again before using cufflinks? Thanks in advance!!

rna-seq • 831 views
ADD COMMENTlink written 21 months ago by Sapi0
1

Newer aligners will tend to perform better so it is not surprising that you are getting better results with HISAT2. Still 55% alignment is low so it would be worth checking a few of non-aligning reads by blast+ at NCBI to see what they are aligning to. It is possible that you could have some contamination in your data or simply rRNA, that you can't use anyway.

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax86k

Thanks for your input genomax. This experiment was about host-pathogen interaction. I tried aligning the unmapped reads to the pathogen reference genome and the average percent was ~1%, still too much left over. Also, I performed Ribo-minus before the library preparation and the rRNA contamination was almost negligible. The FastQC shows this for the overrepresented sequences (top 2-an example):

   Overrepresented sequences
    Sequence                                                                                                                  Count Percentage  Possible Source
    CTCGGCTCTCGCATCGATGAAGAACGTAGCGAAATGCGATACTTGGTGTG   362979  1.4301295212831344 No Hit
    CCCAGGCAGACGTGCCCTCGGCCTAATGGCTTCGGGCGCAACTTGCGTTC  265085  1.0444292483844513  No Hit

The first sequence is Papaver sominferum 5.8s rRNA (NCBI BLAST) and the second one as well is a 5.8s rRNA, so I also suspected rRNA contamination but the percentage is not too much. I agree with your suggestion to perform Blast for all the unmapped reads.

ADD REPLYlink modified 21 months ago by genomax86k • written 21 months ago by Sapi0

I performed Blast with a set of unmapped reads and majority of them hit against rRNA. So, that means even after performing Ribo-minus, there is still rRNA contamination in my data. Thanks genomax for pointing that out.

ADD REPLYlink modified 21 months ago • written 21 months ago by Sapi0

Can you paste the HISAT2 alignment stats here? Meanwhile , you can understand the summary statistics here

What is the percentage of uniquely mapped reads?

EDIT#1

See this related post

Too low mapping percentage using HISAT2 on human reference genome.

ADD REPLYlink modified 21 months ago • written 21 months ago by lakhujanivijay5.1k

Thanks Vijay for your input.

Here is an example of my Hisat2 alignment stats:
36006390 reads; of these:
  36006390 (100.00%) were unpaired; of these:
    13250438 (36.80%) aligned 0 times
    4644716 (12.90%) aligned exactly 1 time
    18111236 (50.30%) aligned >1 times
63.20% overall alignment rate

We were initially worried about the higher percentage of multiple aligned reads but our plant is an auto-tetraploid, may be that's one of the reason. I would like to know your interpretation of these stats. Thanks

ADD REPLYlink modified 21 months ago by genomax86k • written 21 months ago by Sapi0

our plant is an auto-tetraploid

Is the reference you are aligning to, a haploid copy?

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax86k

I just looked at the reference article and it is mentioned that they used a "homozygous doubled-monoploid" to sequence and assemble the genome.

ADD REPLYlink written 21 months ago by Sapi0

Still 50 % data mapped multiple times. Did you check for contamination?

ADD REPLYlink written 21 months ago by lakhujanivijay5.1k

Haven't checked the mapped reads yet.

ADD REPLYlink written 21 months ago by Sapi0

My first question is why has the mapping percentage changed so much when I am still using the same input data (without any trimming) and same reference genome

It is not given the new software performs better than old/previous one, in general. However, the authors of the software (Tophat and HISAT2) recommended HISAT2 in place of Tophat. Given that, the assumption/expectation is that HISAT2 performs better and faster compared to Tophat, You can contact the HISAT2 development group to know better and in detail.

secondly is it okay to proceed with this Hisat2 data or should I trim the adapters and run the Hisat2 again before using cufflinks?

In general, reviewers expect peers to be aware of newer development in the respective fields and use them. Bioinformatics is no different. My free suggestion is to go with Hisat2 instead of Tophat. If there is no adapter contamination (as mentioned in OP), there is no point in doing it.

It would be interesting (to the community here) if you could post some comparisons between Hisat2 vs tophat which can be referred in future posts if some one else comes here with the same question. However that is not necessary and is at your discretion.

Instead of Cufflinks work flow, you can also look at stringtie, ballgown workflow. In my understanding, tophat-cufflinks-cummerbund are supposed to tuxedo suite protocol 1 and hisat2-stringtie-ballgown is tuxedo suite protocol 2.

ADD REPLYlink modified 21 months ago • written 21 months ago by cpad011213k

We are also preferring to use Hisat2. I will surely post a general comparison of mapping percentage between TopHat and Hisat2. I am using Tuxedo protocol1 for the time being but will surely try protocol2 as well to see if there is any difference.

Thanks

ADD REPLYlink written 21 months ago by Sapi0

What options did you use? Did you take care of applying the same rules for reporting an alignment?

ADD REPLYlink written 21 months ago by Martombo2.6k

Hi Martombo, both TopHat AND Hisat2 were performed at default parameter settings.

ADD REPLYlink written 21 months ago by Sapi0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1480 users visited in the last hour