Question: Mapping percent difference between hg38 and hg19
1
gravatar for dina.hesham139
3.2 years ago by
Egypt
dina.hesham139110 wrote:

Hey,

Is it normal that I have a drop of ~10% mapping using hg38 compared to hg19?

I mean I mapped the same set of samples with the same tool and under the same condition. First, alignment against hg19 gives an average of 80s % while alignment against hg38 dropped to an average of 70s %.

Why would that tend to happen?

rna-seq alignment • 1.5k views
ADD COMMENTlink written 3.2 years ago by dina.hesham139110

Exact command was? This could explain it if you only look at unambiguously mapping reads.

ADD REPLYlink written 3.2 years ago by 5heikki8.3k

I used STAR, the only difference is that while building the index for hg38 I included the annotation gtf file in to the command. I didn't do that with hg19. the alignment command was the same for both!

Would that have an effect?

ADD REPLYlink written 3.2 years ago by dina.hesham139110

I also used both genome.fa and annotation file from ensembl in case of hg38, while from UCSC in case of hg19.

ADD REPLYlink written 3.2 years ago by dina.hesham139110

I don't know STAR. What was the exact alignment command you used? How does it report unambiguously mapping reads?

ADD REPLYlink written 3.2 years ago by 5heikki8.3k

the command I used was: STAR --genomeDir /home/hg38 --sjdbGTFfile /home/hg38.gtf --runThreadN 10 --outSAMstrandField intronMotif --readFilesIn /home/fastq_1  /home/fastq_2 --outFileNamePrefix sample1Star

# same for hg19

This is the summary for a samlpe mapped against hg38:

                          Number of input reads |    27873030
                      Average input read length |    202
                                    UNIQUE READS:
                   Uniquely mapped reads number |    21309030
                        Uniquely mapped reads % |    76.45%
                          Average mapped length |    200.36
                       Number of splices: Total |    9873696
            Number of splices: Annotated (sjdb) |    9782721
                       Number of splices: GT/AG |    9786065
                       Number of splices: GC/AG |    73676
                       Number of splices: AT/AC |    8042
               Number of splices: Non-canonical |    5913
                      Mismatch rate per base, % |    0.24%
                         Deletion rate per base |    0.01%
                        Deletion average length |    1.48
                        Insertion rate per base |    0.01%
                       Insertion average length |    1.49
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |    5205343
             % of reads mapped to multiple loci |    18.68%
        Number of reads mapped to too many loci |    33763
             % of reads mapped to too many loci |    0.12%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |    0.00%
                 % of reads unmapped: too short |    4.75%
                     % of reads unmapped: other |    0.01%
                                  CHIMERIC READS:
                       Number of chimeric reads |    0
                            % of chimeric reads |    0.00%

This is the summary for the same sample mapped against hg19:

 Number of input reads |    27873030
                      Average input read length |    202
                                    UNIQUE READS:
                   Uniquely mapped reads number |    24359828
                        Uniquely mapped reads % |    87.40%
                          Average mapped length |    198.69
                       Number of splices: Total |    9840758
            Number of splices: Annotated (sjdb) |    9656134
                       Number of splices: GT/AG |    9744383
                       Number of splices: GC/AG |    72088
                       Number of splices: AT/AC |    7342
               Number of splices: Non-canonical |    16945
                      Mismatch rate per base, % |    0.50%
                         Deletion rate per base |    0.01%
                        Deletion average length |    2.04
                        Insertion rate per base |    0.02%
                       Insertion average length |    1.61
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |    657341
             % of reads mapped to multiple loci |    2.36%
        Number of reads mapped to too many loci |    3645
             % of reads mapped to too many loci |    0.01%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |    0.00%
                 % of reads unmapped: too short |    10.23%
                     % of reads unmapped: other |    0.01%
                                  CHIMERIC READS:
                       Number of chimeric reads |    0
                            % of chimeric reads |    0.00%

 

ADD REPLYlink written 3.2 years ago by dina.hesham139110

And what does the manual of STAR say about mapping of unambiguous reads? What does the manual of STAR say about the use of a GTF file in reference to mapping? You have read the manual, right?

 

https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by 5heikki8.3k

Nothing about  mapping of unambiguous reads!!

 The use of a GTF file in reference to mapping is Highly recommended!!

ADD REPLYlink written 3.2 years ago by dina.hesham139110

It also says something about use of GTF file affecting alignments. Also, unambiguous reads are discussed in the manual (e.g. under multimappers). Not my job to read the manual. If you go through it and compare your reference genomes, unmapped reads, where they map in the other reference, etc. I'm sure you'll figure out what's happening. Good luck!

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by 5heikki8.3k

STAR paper in the Current Protocols in Bioinformatics says "The gene annotations allow STAR to identify and correctly map spliced alignments across known splice junctions. While it is possible to run the mapping jobs without annotations, it is not recommended. When gene annotations are not available, use the 2-pass mapping "

You could map against hg19 without the annotations and see if the percentage drops accordingly but that would be an academic exercise.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by genomax64k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1269 users visited in the last hour