Question

STAR alignment - high % of reads mapped to multiple loci

0

Entering edit mode

14 months ago

sophie • 0

Hello,

I'm relatively new to bulk RNASeq analysis and have run into an issue with quality control, but only for a specific cell type. I have isolated fibro-adipogenic progenitors (FAP) and myofibers (MF) from mouse diaphragms and performed paired-end bulk sequencing on these enriched populations.

I have successfully cut out adapters/polyX sequences using fastp and can tell because there are no over-represented sequences common to all of my samples. I am having issues with uniquely mapped reads in only my MF populations (~70%) and not within my FAP populations (~90%) even though they have been processed identically. I will mention that the FAPs are behaving very similarly to one another and the MFs are also behaving similarly to one another; there are no samples that stand out as poor. I have a bunch of over-represented sequences remaining in only my MF samples (when I BLAST them they show as Col19a1, myosin type 3a or general mitochondrial genome). Is there a specific way I should be mapping these reads that are specific to myofibers? I think the mapping problem is stemming from these over-represented sequences specific to the MF samples, in that these reads cannot be confidently mapped to any of the three mentioned identities, which I imagine will be an issue for quantifying counts later on. I'm not sure how to proceed without over-editing the FAP samples or settling for less-than-ideal mapped reads ratio in the MF samples (I want to edit them similarly so I can biologically compare these populations).

Things I have tried so far include mapping to a different genome reference (mouse NCBI instead of mouse Gencode), I tried skipping fastp processing (over-trimming could have minimized complexity of these over-represented sequences), and I tried having a more lenient PE overlap minimum within STAR. Nothing has helped the mapped ratio by anything more than 1-2%. Any advice/comments are helpful!

Here is an example STAR output of my MF population:

                     Number of input reads |    66611297
                  Average input read length |   197
                                UNIQUE READS:
               Uniquely mapped reads number |   44636674
                    **Uniquely mapped reads % | 67.01%**
                      Average mapped length |   197.48
                   Number of splices: Total |   16645411
        Number of splices: Annotated (sjdb) |   16422096
                   Number of splices: GT/AG |   16494723
                   Number of splices: GC/AG |   121857
                   Number of splices: AT/AC |   10332
           Number of splices: Non-canonical |   18499
                  Mismatch rate per base, % |   0.21%
                     Deletion rate per base |   0.00%
                    Deletion average length |   1.77
                    Insertion rate per base |   0.01%
                   Insertion average length |   1.21
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   21052075
         **% of reads mapped to multiple loci | 31.60%**
    Number of reads mapped to too many loci |   94711
         % of reads mapped to too many loci |   0.14%
                              UNMAPPED READS:

Example STAR output of my FAP population:

                      Number of input reads |   50212948
                  Average input read length |   196
                                UNIQUE READS:
               Uniquely mapped reads number |   46101833
                    **Uniquely mapped reads % | 91.81%**
                      Average mapped length |   196.16
                   Number of splices: Total |   34645401
        Number of splices: Annotated (sjdb) |   34292246
                   Number of splices: GT/AG |   34320536
                   Number of splices: GC/AG |   279275
                   Number of splices: AT/AC |   20118
           Number of splices: Non-canonical |   25472
                  Mismatch rate per base, % |   0.19%
                     Deletion rate per base |   0.01%
                    Deletion average length |   1.85
                    Insertion rate per base |   0.01%
                   Insertion average length |   1.29
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   3308720
         **% of reads mapped to multiple loci | 6.59%**
    Number of reads mapped to too many loci |   116989
         % of reads mapped to too many loci |   0.23%
                              UNMAPPED READS:

Here is my STAR code:

STAR --runThreadN ${num_cores} \
 --readFilesCommand zcat \
 --genomeDir ${star_index} \
 --readFilesIn ${input_file_R1} ${input_file_R2} \
 --outFileNamePrefix ${star_output_dir}/${sample_name2}_ \
 --outSAMtype BAM SortedByCoordinate \
 --twopassMode Basic \
 --peOverlapNbasesMin 10

Here are a few of the FastQC parameters that may be concerning for the MF samples:

enter image description here

STAR fastqc • 1.2k views

ADD COMMENT • link updated 14 months ago by Ram 45k • written 14 months ago by sophie • 0

0

Entering edit mode

Please see the following blog posts by authors for FastQC that will be informative :

https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/
https://sequencing.qcfail.com/articles/libraries-can-contain-technical-duplication/

Nothing has helped the mapped ratio by anything more than 1-2%.

This is characteristic of your libraries/samples. No bioinformatics magic is going to change this result. Were these samples prepared at the same time with same method. Hopefully there was no obvious batch effect (e.g. MF done on one day by one person) involved.

ADD REPLY • link 14 months ago by GenoMax 153k

0

Entering edit mode

Thank you for your comment. We were unable to isolate these cell populations in parallel because myofibers are multinucleated (cannot go through FACS, which is how we isolated FAPs). We are aware that the isolation conditions were suboptimal since isolation took about 45min at RT and RNA could have easily been compromised during that time. Is there a way I can verify that this is the issue?

ADD REPLY • link 14 months ago by sophie • 0

0

Entering edit mode

Experimental biology is not the easiest thing so not much you can do there. You will want to add a batch variable and keep track of the differences when you analyze the data. At this point you will need to move forward with the analysis with what you have.

RNA could have easily been compromised during that time. Is there a way I can verify that this is the issue?

This should have been apparent in the library QC but at this point since you have already completed the sequencing that would be going backwards.

ADD REPLY • link 14 months ago by GenoMax 153k