Why Am I Getting So Many Unmapped Reads In Star, Classified As "Too Short"?
2
6
Entering edit mode
7.9 years ago
gaelgarcia05 ▴ 230

I am currently using STAR to map several Hi-SEQ mRNA runs.

I'm having trouble getting a decent amount of reads to map, but I don't really understand why. I'm hoping you can shed some light :)

In the final log, only about 50% (or less) of the reads map to the reference. I'm using a GTF in addition to the genome.

The unmapped bin that most of the reads fall into is "too short" . What parameter I might be mis-specifying? These are PE 101 Illumina reads, the quality is pretty good up until the ~85th base, and we have around 200M reads per sample.

What may be causing the unmapped reads: too short set of reads to be so large, I'd really appreciate it.

Thanks! Carmen

My command like is like so:

# $1 = READ1 fq file #$2 = READ2 fq file
# $3 = PREFIX for Output Files [*.BAM] /path/to/STAR --genomeDir /path/to/Fly/ --readFilesCommand 'zcat -fc' --readFilesIn$1 $2 --runThreadN 32 --genomeLoad LoadAndRemove --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 2 --outSAMstrandField None --outSAMmode Full --outSAMattributes Standard --outSAMunmapped None --outFilterType BySJout --outStd SAM | samtools view -b -o$3_STAR.bam -S -


And all Final Logs look something like this:

./Log.final.out

Started job on |    May 20 22:49:23
Started mapping on |    May 20 22:52:08
Finished on |    May 21 05:18:10
Mapping speed, Million of reads per hour |    32.74

Number of input reads |    210640950
Average input read length |    202
Uniquely mapped reads number |    29841188
Uniquely mapped reads % |    14.17%
Average mapped length |    190.07
Number of splices: Total |    4405621
Number of splices: Annotated (sjdb) |    4123661
Number of splices: GT/AG |    4348429
Number of splices: GC/AG |    25290
Number of splices: AT/AC |    664
Number of splices: Non-canonical |    31238
Mismatch rate per base, % |    1.23%
Deletion rate per base |    0.03%
Deletion average length |    1.92
Insertion rate per base |    0.02%
Insertion average length |    2.41
Number of reads mapped to multiple loci |    36646260
% of reads mapped to multiple loci |    17.40%
Number of reads mapped to too many loci |    229494
% of reads mapped to too many loci |    0.11%
% of reads unmapped: too many mismatches |    0.00%
% of reads unmapped: too short |    58.84%
% of reads unmapped: other |    9.49%


This is the command line I used to build the STAR index.

/path/to/STAR_2.3.0e/STAR --runMode genomeGenerate --genomeDir /path/to/Genomes/Fly/ --genomeFastaFiles /path/to/genomes/fly/dm3_genome.fa --runThreadN 16  --sjdbGTFfile /path/to/dm3_refGene_2011_02_15.gtf --sjdbGTFtagExonParentTranscript transcript_id --sjdbOverhang 100

mapping rnaseq rna-seq STAR • 10k views
0
Entering edit mode

When people have problems with a specific tool the best way is to ask on the tool developer's support page. They are the ones that understand subtle details about the tool

0
Entering edit mode

This is of course way too late but I guess that

--sjdbOverhang 100 should be 
--sjdbOverhang 201

for paired end reads this is  mate length (2*101) - 1, as recommended in the manual.

STAR's mapping rate is sensitive for this parameter which means that  runs of genom generate are required for each read- length.

3
Entering edit mode

according to the STAR manual 2.15a available here: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf:

--sjdbOverhang specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads. For instance, for Illumina 2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the ideal value is max(ReadLength)-1. In most cases, the default value of 100 will work as well as the ideal value.

So in this user's case --sjdbOverhang 100 is fine.

1
Entering edit mode
5.8 years ago
cyril-cros ▴ 910

Try to clip your reads to the 85th base using seqtk and run STAR again (STAR + 32 Core => it will be really fast). No need to sort your reads. Quite often, the end quality of RNASea reads are not so good and they get rejected due to mismatch with the genomic sequences.

0
Entering edit mode
3.3 years ago

This issue on STAR offers some suggestions to improve mapping of reads.

Traffic: 1667 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.