Question: Why Am I Getting So Many Unmapped Reads In Star, Classified As "Too Short"?
5
gravatar for carmeyeii89
4.4 years ago by
carmeyeii89140
UNAM,MX
carmeyeii89140 wrote:

I am currently using STAR to map several Hi-SEQ mRNA runs.

I'm having trouble getting a decent amount of reads to map, but I don't really understand why. I'm hoping you can shed some light :)

In the final log, only about 50% (or less) of the reads map to the reference. I'm using a GTF in addition to the genome.

The unmapped bin that most of the reads fall into is "too short" . What parameter I might be mis-specifying? These are PE 101 Illumina reads, the quality is pretty good up until the ~85th base, and we have around 200M reads per sample.

What may be causing the unmapped reads: too short set of reads to be so large, I'd really appreciate it.

Thanks! Carmen

My command like is like so:

# $1 = READ1 fq file
# $2 = READ2 fq file
# $3 = PREFIX for Output Files [*.BAM]

/path/to/STAR --genomeDir /path/to/Fly/ --readFilesCommand 'zcat -fc' --readFilesIn $1 $2 --runThreadN 32 --genomeLoad LoadAndRemove --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 2 --outSAMstrandField None --outSAMmode Full --outSAMattributes Standard --outSAMunmapped None --outFilterType BySJout --outStd SAM | samtools view -b -o $3_STAR.bam -S -

And all Final Logs look something like this:

./Log.final.out



Started job on |    May 20 22:49:23
                         Started mapping on |    May 20 22:52:08
                                Finished on |    May 21 05:18:10
   Mapping speed, Million of reads per hour |    32.74

                      Number of input reads |    210640950
                  Average input read length |    202
                                UNIQUE READS:
               Uniquely mapped reads number |    29841188
                    Uniquely mapped reads % |    14.17%
                      Average mapped length |    190.07
                   Number of splices: Total |    4405621
        Number of splices: Annotated (sjdb) |    4123661
                   Number of splices: GT/AG |    4348429
                   Number of splices: GC/AG |    25290
                   Number of splices: AT/AC |    664
           Number of splices: Non-canonical |    31238
                  Mismatch rate per base, % |    1.23%
                     Deletion rate per base |    0.03%
                    Deletion average length |    1.92
                    Insertion rate per base |    0.02%
                   Insertion average length |    2.41
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |    36646260
         % of reads mapped to multiple loci |    17.40%
    Number of reads mapped to too many loci |    229494
         % of reads mapped to too many loci |    0.11%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |    0.00%
             % of reads unmapped: too short |    58.84%
                 % of reads unmapped: other |    9.49%

This is the command line I used to build the STAR index.

/path/to/STAR_2.3.0e/STAR --runMode genomeGenerate --genomeDir /path/to/Genomes/Fly/ --genomeFastaFiles /path/to/genomes/fly/dm3_genome.fa --runThreadN 16  --sjdbGTFfile /path/to/dm3_refGene_2011_02_15.gtf --sjdbGTFtagExonParentTranscript transcript_id --sjdbOverhang 100
star rna-seq mapping rnaseq • 6.1k views
ADD COMMENTlink modified 2.3 years ago by cyril-cros800 • written 4.4 years ago by carmeyeii89140

When people have problems with a specific tool the best way is to ask on the tool developer's support page. They are the ones that understand subtle details about the tool

ADD REPLYlink written 4.4 years ago by Istvan Albert ♦♦ 74k

This is of course way too late but I guess that

--sjdbOverhang 100 should be 
--sjdbOverhang 201

for paired end reads this is  mate length (2*101) - 1, as recommended in the manual.

STAR's mapping rate is sensitive for this parameter which means that  runs of genom generate are required for each read- length.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Michael Dondrup43k
3

according to the STAR manual 2.15a available here: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf:

--sjdbOverhang specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads. For instance, for Illumina 2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the ideal value is max(ReadLength)-1. In most cases, the default value of 100 will work as well as the ideal value.

So in this user's case --sjdbOverhang 100 is fine.

ADD REPLYlink written 20 months ago by JHumphrey30
1
gravatar for cyril-cros
2.3 years ago by
cyril-cros800
France
cyril-cros800 wrote:

Try to clip your reads to the 85th base using seqtk and run STAR again (STAR + 32 Core => it will be really fast). No need to sort your reads. Quite often, the end quality of RNASea reads are not so good and they get rejected due to mismatch with the genomic sequences.

ADD COMMENTlink written 2.3 years ago by cyril-cros800
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1308 users visited in the last hour