I'm trying to align mouse RNA-seq, but I'm running into the 'too short' problem with STAR. Basically, all of the reads are being filtered because of this. The confusing part is that both read 1 and read 2 seem to map just fine if I map them separately as single end reads.
Here is the command for mapping the paired reads:
STAR \
--runMode alignReads \
--genomeDir $STAR_index \
--readFilesIn $scratch/${sample}_tmp/${sample}_R1.fastq.gz $scratch/${sample}_tmp/${sample}_R2.fastq.gz \
--readFilesCommand zcat \
--runThreadN $THREADS \
--outFileNamePrefix $BASEDIR/${sample}/${genome}/STAR/STAR_alignment/${sample}_ \
--outReadsUnmapped Fastx \
--outSAMtype BAM SortedByCoordinate \
&> $BASEDIR/${sample}/${genome}/logs/${sample}_Star.log
And here's the log file:
Number of input reads | 51150550
Average input read length | 202
UNIQUE READS:
Uniquely mapped reads number | 3907
Uniquely mapped reads % | 0.01%
Average mapped length | 175.69
Number of splices: Total | 690
Number of splices: Annotated (sjdb) | 2
Number of splices: GT/AG | 515
Number of splices: GC/AG | 61
Number of splices: AT/AC | 0
Number of splices: Non-canonical | 114
Mismatch rate per base, % | 5.56%
Deletion rate per base | 0.03%
Deletion average length | 1.91
Insertion rate per base | 0.01%
Insertion average length | 1.84
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 126222
% of reads mapped to multiple loci | 0.25%
Number of reads mapped to too many loci | 6978
% of reads mapped to too many loci | 0.01%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 51012111
Here is the command for the single end mapping:
STAR \
--runMode alignReads \
--genomeDir $STAR_index \
--readFilesIn $scratch/${sample}_tmp/${sample}_R1.fastq.gz \
--readFilesCommand zcat \
--runThreadN $THREADS \
--outFileNamePrefix $BASEDIR/${sample}/${genome}/STAR/STAR_alignment/${sample}_ \
--outReadsUnmapped Fastx \
--outSAMtype BAM SortedByCoordinate \
&> $BASEDIR/${sample}/${genome}/logs/${sample}_Star.log
and the accompanying log file:
Number of input reads | 51150550
Average input read length | 101
UNIQUE READS:
Uniquely mapped reads number | 44108161
Uniquely mapped reads % | 86.23%
Average mapped length | 100.16
Number of splices: Total | 19083172
Number of splices: Annotated (sjdb) | 18953752
Number of splices: GT/AG | 18956436
Number of splices: GC/AG | 96098
Number of splices: AT/AC | 11062
Number of splices: Non-canonical | 19576
Mismatch rate per base, % | 0.21%
Deletion rate per base | 0.01%
Deletion average length | 1.33
Insertion rate per base | 0.01%
Insertion average length | 1.24
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 5988911
% of reads mapped to multiple loci | 11.71%
Number of reads mapped to too many loci | 280004
% of reads mapped to too many loci | 0.55%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 749975
% of reads unmapped: too short | 1.47%
Number of reads unmapped: other | 23499
% of reads unmapped: other | 0.05%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
Does anyone know why the paired end mapping seems to think all the reads are too short (e.g. more than 1/3 of the read does not map). Thanks for the help.
Have you seen: A: Long Read Length, yet STAR says many reads too short
yes, I've seen this one. FastQC suggests each mate is 100bp. And I realize it's not that the reads are short necessarily, it's that more than 1/3 of the total read is not mapping. The default value being 0.66. I can't explain why read 1 and read 2 will map just fine on their own, but as paired end reads they will not map.
Did you trim the reads independently by any chance? Perhaps your R1/R2 files are out of sync. You can try using
repair.sh
in that case to re-sync and remove any singletons.I don't generally trim reads when aligning with star. I haven't tried repair.sh though. I'll see how that goes and report back. Thank you.
If you did not trim then the reads would not be out of sync. There must be some other reason.