Question: Very low unique mapped reads - STAR alignment
0
gravatar for junsionglow
8 months ago by
junsionglow20
junsionglow20 wrote:

Hi all, while looking at my log.final.out file, it dawned on me that my uniquely mapped reads is really low. See below:

       Mapping speed, Million of reads per hour |       113.33

                          Number of input reads |       11050047
                      Average input read length |       300
                                    UNIQUE READS:
                   Uniquely mapped reads number |       24100
                        Uniquely mapped reads % |       0.22%
                          Average mapped length |       279.06
                       Number of splices: Total |       31927
            Number of splices: Annotated (sjdb) |       6179
                       Number of splices: GT/AG |       11720
                       Number of splices: GC/AG |       315
                       Number of splices: AT/AC |       119
               Number of splices: Non-canonical |       19773
                      Mismatch rate per base, % |       1.65%
                         Deletion rate per base |       0.02%
                        Deletion average length |       3.24
                        Insertion rate per base |       0.02%
                       Insertion average length |       2.50
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       16802
             % of reads mapped to multiple loci |       0.15%
        Number of reads mapped to too many loci |       630
             % of reads mapped to too many loci |       0.01%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.00%
                 % of reads unmapped: too short |       99.61%
                     % of reads unmapped: other |       0.02%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

I used STAR to make an index and subsequently aligned it. My command for STAR is as follow

STAR --genomeDir /home/user/scratch60/hg38_index \
    --runThreadN 6 \
    --readFilesIn /home/user/scratch60/SRR7059136.fastq \
    --outFileNamePrefix /home/user/scratch60/STARresults/SRR7059136  \
    --outSAMtype BAM SortedByCoordinate \
    --outSAMunmapped Within \
    --outSAMattributes Standard

If some one could shed some light onto this, that would be much appreciated. Thanks!

rna-seq • 1.2k views
ADD COMMENTlink modified 8 months ago by h.mon24k • written 8 months ago by junsionglow20
1

have you done QC of fastq files? And are you sure that you are mapping to a correct reference genome?

ADD REPLYlink written 8 months ago by grant.hovhannisyan1.5k
1

"Reads too short" doesn't really mean that, it just means "didn't map".

As grant says, the first things to check are: is the fastq okay, or is it low quality garbage?

Second, are you sure this is the right reference?

ADD REPLYlink written 8 months ago by swbarnes25.3k

Thanks guys, these fastq files are from published GEO SRA https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3109339 and I obtained the Fastq via fastq-dump.

They used GENCODE GRCh38 v26 gene annotation, while I used these to make the index,

GCF_000001405.26_GRCh38_genomic.fna 
GCF_000001405.26_GRCh38_genomic.gff

but shouldn't make a difference?

ADD REPLYlink modified 8 months ago • written 8 months ago by junsionglow20
1

Could you naively try to blast a few unmapped reads?

ADD REPLYlink written 8 months ago by WouterDeCoster38k

yeah, they do belong to the right GRCh38 assembly... i'm puzzled!

ADD REPLYlink written 8 months ago by junsionglow20

% of reads unmapped: too short | 99.61%

You lose a lot of reads here...

ADD REPLYlink written 8 months ago by WouterDeCoster38k
3
gravatar for h.mon
8 months ago by
h.mon24k
Brazil
h.mon24k wrote:

Solution: Download the properly separated forward and reverse fastq files from ENA, or use fastq-dump --split-files. Downloading from ENA is faster and less error-prone. Then map the files as paired-end reads, not as single-end.

Diagnostic:

The page you linked says:

Data processing Paired-end reads were mapped to reference genome using STAR version 2.5.2b

How did you download the fastq files? You are mapping as single-end, but the original data is paired end:

SRR7059136 - Illumina HiSeq 4000 paired end sequencing

I suspect by not splitting the forward and reverse reads, the reads that originally overlaped now become artifact inverted duplicates. Look at the coordinates of this blast, made with the first read from the record you linked, but not split as it should.

 Homo sapiens clone LA13_165F6 sequence
Sequence ID: KC876030.1 Length: 36220 Number of Matches: 2
Related Information
Range 1: 22403 to 22552
Next Match
Previous Match
Alignment statistics for match #1 Score Expect  Identities  Gaps    Strand
260 bits(288)   1e-65   147/150(98%)    0/150(0%)   Plus/Plus

Query  1      TCTCAGGCTCCCTCTCCGGAATCGAACCCTGATTCCCCGTCACCCGTGGTCACCATGGTA  60
              ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  22403  TCTCAGGCTCCCTCTCCGGAATCGAACCCTGATTCCCCGTCACCCGTGGTCACCATGGTA  22462

Query  61     GGCACGGNGACTACCATCGAAAGTTGATAGGGCAGACGTTCGAATGGGTCGTCGCCGCCA  120
              ||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  22463  GGCACGGCGACTACCATCGAAAGTTGATAGGGCAGACGTTCGAATGGGTCGTCGCCGCCA  22522

Query  121    CGGGGGGCGTGCGNTCGGNCCGAGGTTATC  150
              ||||||||||||| |||| |||||||||||
Sbjct  22523  CGGGGGGCGTGCGATCGGCCCGAGGTTATC  22552


Range 2: 22408 to 22557
Next Match
Previous Match
First Match
Alignment statistics for match #2 Score Expect  Identities  Gaps    Strand
210 bits(232)   2e-50   134/150(89%)    0/150(0%)   Plus/Minus

Query  151    CTCTNGATAACCTCGGGCCGATNGCACGCCCCCNGTGGCGGCGACGACCCATTCGANCGA  210
              |||| ||||||||||||||||| |||||||||| |||||||||||||||||||||| || 
Sbjct  22557  CTCTAGATAACCTCGGGCCGATCGCACGCCCCCCGTGGCGGCGACGACCCATTCGAACGT  22498

Query  211    NTNCCCTNTCAACNNTNGATNGTAGNCGCCGTGCCTACCATGGTGACCACGGGAGACGGG  270
               | |||| |||||  | ||| |||| ||||||||||||||||||||||||||| ||||||
Sbjct  22497  CTGCCCTATCAACTTTCGATGGTAGTCGCCGTGCCTACCATGGTGACCACGGGTGACGGG  22438

Query  271    GAATCAGGGTTCGATTCNGGAGAGGGAACC  300
              ||||||||||||||||| ||||||||| ||
Sbjct  22437  GAATCAGGGTTCGATTCCGGAGAGGGAGCC  22408
ADD COMMENTlink modified 8 months ago • written 8 months ago by h.mon24k

thanks!! I will give it a try and keep you updated. Thanks again!

ADD REPLYlink written 8 months ago by junsionglow20

Hi h.mon, it worked wonderfully, now I have over 80% unique maps! thanks again!

ADD REPLYlink written 8 months ago by junsionglow20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1026 users visited in the last hour