Question: Calculate Mapping Rate for an alignment to each sequence of a multi-fasta reference
gravatar for kspata
13 months ago by
kspata50 wrote:


I have aligned a sample PE RNA-seq data to a multi-fasta reference. I wish to find out mapping rate for each sequence of the multi-fasta reference. How can I extract that information from SAM or BAM file?

I have bowtie2 mapping rate for the overall alignment.

**6259700** reads; of these:
  6259700 (100.00%) were paired; of these:
    6192017 (98.92%) aligned concordantly 0 times
    **2317** (0.04%) aligned concordantly exactly 1 time
    **65366** (1.04%) aligned concordantly >1 times
    6192017 pairs aligned concordantly 0 times; of these:
      **151** (0.00%) aligned discordantly 1 time
    6191866 pairs aligned 0 times concordantly or discordantly; of these:
      12383732 mates make up the pairs; of these:
        12379013 (99.96%) aligned 0 times
        **479** (0.00%) aligned exactly 1 time
        **4240** (0.03%) aligned >1 times
1.12% overall alignment rate

I looked up the formula on this link [][1] to calculate mapping rate from this information.

ANS: 1.121%

I realized that I will need to extract the information in BOLD for each reference sequence from a single BAM/SAM file to calculate mapping rate.

Can anyone please give an advice on how to achieve this?

ADD COMMENTlink modified 13 months ago by h.mon26k • written 13 months ago by kspata50

However you calculate it, your mapping rate is really low, indicating problems with your data. Is this just an example, or are these numbers from your data?

ADD REPLYlink written 13 months ago by h.mon26k

Thank you for replying. These are the numbers from my data. The multi fasts file reference which I used is a synthetic construct. The total mRNA data was mapped to this reference. The data is Miseq with PE 300. What can I do to increase the mapping rate.? I used bowtie2 with -I 0 and -X 1000 with -fr option.

ADD REPLYlink written 13 months ago by kspata50

With a <1% mapping rate you have a serious issue with your data. Try the following:

Run fastqc on your input files. Are there any errors or warning?

Align to the reference genome for your species with default bowtie2 setting. Are you now getting most of your reads mapped? If so, your synthetic construct is likely to be garbage. Work out what when wrong and make a new one.

If you still have 99% of your reads not mapping to your reference genome then it's likely to be an issue with library preparation/sequencing. The most likely explanation is that your library preparation failed (e.g did you accidentally wash all your DNA away in one of your steps?) and there's nothing you can do bioinformatically to recover.

Try BLASTing a few hundred of the unmapped reads. Do they actually come from your species?

ADD REPLYlink modified 13 months ago • written 13 months ago by d-cameron2.0k

The multi fasts file reference which I used is a synthetic construct.

Could you add further details? Is this a complete genome, with a synthetic construct added? Or your reference genome consists of a synthetic construct only? What is the organism you are studying?

ADD REPLYlink written 13 months ago by h.mon26k

The reference is sequence of 4 transcripts. I performed a BLAST and came to know it mapped closely to immunoglobulin light and heavy chain. This is all the information which I could gather.

ADD REPLYlink written 13 months ago by kspata50

You can't map a RNAseq dataset to a reference of just four transcripts, there probably will be a lot of spurious mappings due to relatively similar transcripts mapping to your "reference", as the correct genes are not present in your reference.

Why are you doing this? What exactly do you expect to accomplish?

ADD REPLYlink written 13 months ago by h.mon26k

I wish to know variants in my data with respect to this reference. But even though the alignment rate is low I was interested to know if mapping rates for each sequence in a multi-fasta reference file can be calculated using SAM or BAM file?

ADD REPLYlink modified 13 months ago • written 13 months ago by kspata50

You first need to address your very low mapping rate as addressed by d-cameron

You also should align to the full genome using a spliced aligner such as HISAT2 or STAR.

ADD REPLYlink written 13 months ago by WouterDeCoster39k

Thank you for the suggestion. I will use these tools and align input data to a complete genome.

ADD REPLYlink written 13 months ago by kspata50

When you say overall alignment rate (OVR), you need to consider a lot of factors

  • uniquely mapped reads
  • multi-mapped reads
  • paired reads and singletons
  • directionality and distance of reads while aligning (concordant and discordant reads)

All of these will contribute to the overall alignment rate which I have tried explaining here

Calculating OVR for all the sequences individually will be losing a lot of that information (for example, exon skipping, probable duplications, translocations etc.). If you want to know the just the number of raw reads, than that's a separate stuff.

ADD REPLYlink written 13 months ago by Vijay Lakhujani4.2k
gravatar for h.mon
13 months ago by
h.mon26k wrote:

I will summarize / add recommendations and leave further comments. I will make some assumptions, as not all details are clear from your post. My main assumption is that your data is whole transcriptome RNA-seq, not some kind of targeted library.

1) Map the reads to the whole genome / whole transcriptome, not to an artificial, hugely down-sized artificial reference. When using an artificial, small reference, most likely there will be spurious mappings of somewhat similar reads, which would map to the correct location, in case the correct location were present.

2) Do not use bowtie if mapping RNAseq to a reference genome - if using a reference transcriptome it can be used.

3) To evaluate mapping rate to regions of interest, you can use samtools or bedtools specifying the interval of interest to count reads mapping to specif genes. As Vijay pointed out, you will have to think about what include / exclude when calculating mapping rates.

4) Mapping reads to immunoglobulins is complicated, as these regions may undergo somatic recombination - you did not say the source material of your reads. See, for example, the first paragraph from the "Results" section from the paper Immunoglobulin transcript sequence and somatic hypermutation computation from unselected RNA-seq reads in chronic lymphocytic leukemia.

Hope I've helped.

ADD COMMENTlink written 13 months ago by h.mon26k
gravatar for toralmanvar
13 months ago by
toralmanvar770 wrote:

You can extract the number of reads mapping to each of your reference sequence using samtools idxstats. In output file following are the columns:

  1. represents reference sequence name
  2. reference sequence length
  3. number of reads mapping to reference genome
  4. number of unmapped reads whose mate is mapped

Using 3rd column info, you can calculate the percentage of mapped reads on each reference.

ADD COMMENTlink modified 13 months ago • written 13 months ago by toralmanvar770

These are the results for

 samtools idxstats input.sorted.bam

    X1       3580    9483    0
    X2     2836    8110    0
    X3     2188    8220    0
    X4   1468    6698    0
    *       0       0       0

These are the results of samtools flagstat

samtools flagstat input.sorted.bam

32511 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
32511 + 0 mapped (100.00%:-nan%)
32511 + 0 paired in sequencing
16715 + 0 read1
15796 + 0 read2
31500 + 0 properly paired (96.89%:-nan%)
31522 + 0 with itself and mate mapped
989 + 0 singletons (3.04%:-nan%)
10 + 0 with mate mapped to a different chr
6 + 0 with mate mapped to a different chr (mapQ>=5)

How can I calculate mapping rate say for X1 from this data? Will it be

(number of reads mapped to X1/ Total mapped reads)*100  
(9483/32511)*100 = 29.17%

Please let me know if this is correct.

ADD REPLYlink written 13 months ago by kspata50
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 953 users visited in the last hour