Question

Very low % of RNAseq reads mapped using STAR

0

Entering edit mode

20 months ago

Giuliana • 0

Hello, I'm new to bioinformatics so this might be a dumb question. I'm trying to map a publicly available RNA-seq fastq (from 2014) to the hg 19 using STARalign. The percentage of mapped reads is 0.2%. Could this be happening because of the format of FASTQ file? The SRA page for this dataset says it's paired reads but there's only one fastq file to be downloaded, with 202 bp long reads. Please see example of the reads below. Im also adding the output summary after the alignment. I've ran this before for my samples and they align perfectly. Is there a way to fix this?

Thank you SO much!!

Example of reads

@SRR111111.1 HWI-ST1220:65:C0MCWACXX:2:1101:1650:2407 length=202
GGAACACCTCCGCTNAATAGGCGTGGTTAGAGACGAAGAGGGACTCGCTGGCAGCAGCCCCAGCCTGACCGCTCGGAGTGTACTTTCCTTGACAGGCAAGGCCTCAATGCCATTAACAAGTGCCCCCTGCTGAAGCCCTGGGCCCTGACCTTCTCCTACGGCCGAGCCCTGCAGGCCTCTGCCCTGAAGGCCTGGGGCGGGA
+SRR111111.1 HWI-ST1220:65:C0MCWACXX:2:1101:1650:2407 length=202
@CCFDFFDHHFHGG#2AFBBHGIGHGHIGHBGE;G><GGHID;FFAABH=E?CEA@CCB?>;;ABC25>:8;3;>>008BC(:@>>C>CCA>CACB(<B##@@@FFFFFA>DHFHJJJJIFGIJGAHIEHGGIGI>HHHIJGEEFIEFFHIFHGGICHIGIIHB<ABDDDDDDCA?@>B>A@CDDA?A>>ABDBDBDBDDB5
@SRR111111.2 HWI-ST1220:65:C0MCWACXX:2:1101:1721:2430 length=202
ATCATCAGTAGGGTNAAACTAACCTGTCTCACGACGGTCTAACCCCAGCTCACGTTCCCTATTAGTGGGTGAACAATCCAACGCTTGGTGAATTCTGCTTCAAGCGTTCATAGCGACGTCGCTTTTTGATCCTTCGATGTCGGCTCTTCCTATCATTGTGAAGCAGAATTCACCAAGCGTTGGATTGTTCACCCACTAATAG
+SRR111111.2 HWI-ST1220:65:C0MCWACXX:2:1101:1721:2430 length=202
CCCFFFFFFHHHHG#2AFHIJIIJJJHGIIBGHGJGIEEHGIJJJJIIIJJGIIAEHHFFDFFDFEEDD>6;>>CACCC@CA@98?7?CBDDEECEEDACD@CCFFDDFHHHHGIIJJHIIJJJJIJIIIJJJJJIIJJJIIJJJIJJFHE=ACDDFFFDFFCEEEEDDDDDDDCDDDDBBDDDDCCDDCCCDDDBBDDDDD
@SRR111111.3 HWI-ST1220:65:C0MCWACXX:2:1101:1608:2458 length=202
GTTCTTAGTTGGTGNAGCGATTTGTCTGGTTAATTCCGATAACGAACGAGACTCTGGCATGCTAACTAGTTACGCGACCCCCGAGCGGTCGGCGAGATCGGCGCCGACCGCTCGGGGGTCGCGTAACTAGTTAGCATGCCAGAGTCTCGTTCGTTATCGGAATTAACCAGACAAATCGCTCCACCAACTAAGAACAGATCGG

OUTPUT FROM START

                      Number of input reads |   46256929
                  Average input read length |   202
                                UNIQUE READS:
               Uniquely mapped reads number |   92715
                    Uniquely mapped reads % |   0.20%
                      Average mapped length |   193.31
                   Number of splices: Total |   114909
        Number of splices: Annotated (sjdb) |   22204
                   Number of splices: GT/AG |   38543
                   Number of splices: GC/AG |   1475
                   Number of splices: AT/AC |   442
           Number of splices: Non-canonical |   74449
                  Mismatch rate per base, % |   1.72%
                     Deletion rate per base |   0.05%
                    Deletion average length |   5.05
                    Insertion rate per base |   0.04%
                   Insertion average length |   4.63
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   66687
         % of reads mapped to multiple loci |   0.14%
    Number of reads mapped to too many loci |   3137
         % of reads mapped to too many loci |   0.01%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |   0.00%
             % of reads unmapped: too short |   99.64%
                 % of reads unmapped: other |   0.01%
                              CHIMERIC READS:
                   Number of chimeric reads |   0
                        % of chimeric reads |   0.00%

alignment rna-seq STAR • 757 views

ADD COMMENT • link updated 20 months ago by GenoMax 141k • written 20 months ago by Giuliana • 0

0

Entering edit mode

It appears that you have obfuscated the actual SRR# so we can't tell you what the data for that accession should look like but what lieven.sterck said below is likely what is happening. You will need to use --split-files option when dumping reads from SRA.

ADD REPLY • link 20 months ago by GenoMax 141k

score 0 · Answer 1 · 2022-08-31

How did you obtain this input dataset?

The problem is likely that you did not convert the sra format to fastq. The SRA format looks like FASTQ but actually is not. As you notice it has 202 long reads, which actually are the forward and the reverse read concatenated into one read. The conversion of SRA to FASTQ splits those back into the correct paired end reads.

Can't recall the tool, momentarily, that you need to use :/

That you observe a very low mapping rate is also linked to this concatenation of reads. At best around 50% of the read length (only the "forward read" or "reverse read" part) will be able to map and as such most mappings will be discarded because they're not long enough compared to the input "read" length.