Question

trouble with 'too short' reads using STAR. each mate maps fine, but the paired end mapping does not

2

Entering edit mode

4.3 years ago

james.alan.gregory ▴ 20

I'm trying to align mouse RNA-seq, but I'm running into the 'too short' problem with STAR. Basically, all of the reads are being filtered because of this. The confusing part is that both read 1 and read 2 seem to map just fine if I map them separately as single end reads.

Here is the command for mapping the paired reads:

STAR \
    --runMode alignReads \
    --genomeDir $STAR_index \
    --readFilesIn $scratch/${sample}_tmp/${sample}_R1.fastq.gz $scratch/${sample}_tmp/${sample}_R2.fastq.gz \
    --readFilesCommand zcat \
    --runThreadN $THREADS \
    --outFileNamePrefix $BASEDIR/${sample}/${genome}/STAR/STAR_alignment/${sample}_ \
    --outReadsUnmapped Fastx \
    --outSAMtype BAM SortedByCoordinate \
    &> $BASEDIR/${sample}/${genome}/logs/${sample}_Star.log

And here's the log file:

Number of input reads |       51150550
                      Average input read length |       202
                                    UNIQUE READS:
                   Uniquely mapped reads number |       3907
                        Uniquely mapped reads % |       0.01%
                          Average mapped length |       175.69
                       Number of splices: Total |       690
            Number of splices: Annotated (sjdb) |       2
                       Number of splices: GT/AG |       515
                       Number of splices: GC/AG |       61
                       Number of splices: AT/AC |       0
               Number of splices: Non-canonical |       114
                      Mismatch rate per base, % |       5.56%
                         Deletion rate per base |       0.03%
                        Deletion average length |       1.91
                        Insertion rate per base |       0.01%
                       Insertion average length |       1.84
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       126222
             % of reads mapped to multiple loci |       0.25%
        Number of reads mapped to too many loci |       6978
             % of reads mapped to too many loci |       0.01%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |       0
       % of reads unmapped: too many mismatches |       0.00%
            Number of reads unmapped: too short |       51012111

Here is the command for the single end mapping:

STAR \
        --runMode alignReads \
        --genomeDir $STAR_index \
        --readFilesIn $scratch/${sample}_tmp/${sample}_R1.fastq.gz \
        --readFilesCommand zcat \
        --runThreadN $THREADS \                                                                                                 
        --outFileNamePrefix $BASEDIR/${sample}/${genome}/STAR/STAR_alignment/${sample}_ \
        --outReadsUnmapped Fastx \   
        --outSAMtype BAM SortedByCoordinate \
        &> $BASEDIR/${sample}/${genome}/logs/${sample}_Star.log

and the accompanying log file:

Number of input reads |       51150550
                      Average input read length |       101
                                    UNIQUE READS:
                   Uniquely mapped reads number |       44108161
                        Uniquely mapped reads % |       86.23%
                          Average mapped length |       100.16
                       Number of splices: Total |       19083172
            Number of splices: Annotated (sjdb) |       18953752
                       Number of splices: GT/AG |       18956436
                       Number of splices: GC/AG |       96098
                       Number of splices: AT/AC |       11062
               Number of splices: Non-canonical |       19576
                      Mismatch rate per base, % |       0.21%
                         Deletion rate per base |       0.01%
                        Deletion average length |       1.33
                        Insertion rate per base |       0.01%
                       Insertion average length |       1.24
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       5988911
             % of reads mapped to multiple loci |       11.71%
        Number of reads mapped to too many loci |       280004
             % of reads mapped to too many loci |       0.55%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |       0
       % of reads unmapped: too many mismatches |       0.00%
            Number of reads unmapped: too short |       749975
                 % of reads unmapped: too short |       1.47%
                Number of reads unmapped: other |       23499
                     % of reads unmapped: other |       0.05%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

Does anyone know why the paired end mapping seems to think all the reads are too short (e.g. more than 1/3 of the read does not map). Thanks for the help.

RNA-Seq alignment • 3.5k views

ADD COMMENT • link updated 4.1 years ago by swbarnes2 15k • written 4.3 years ago by james.alan.gregory ▴ 20

0

Entering edit mode

Have you seen: A: Long Read Length, yet STAR says many reads too short

ADD REPLY • link 4.3 years ago by GenoMax 152k

0

Entering edit mode

yes, I've seen this one. FastQC suggests each mate is 100bp. And I realize it's not that the reads are short necessarily, it's that more than 1/3 of the total read is not mapping. The default value being 0.66. I can't explain why read 1 and read 2 will map just fine on their own, but as paired end reads they will not map.

ADD REPLY • link 4.3 years ago by james.alan.gregory ▴ 20

0

Entering edit mode

Did you trim the reads independently by any chance? Perhaps your R1/R2 files are out of sync. You can try using repair.sh in that case to re-sync and remove any singletons.

ADD REPLY • link 4.3 years ago by GenoMax 152k

0

Entering edit mode

I don't generally trim reads when aligning with star. I haven't tried repair.sh though. I'll see how that goes and report back. Thank you.

ADD REPLY • link 4.3 years ago by james.alan.gregory ▴ 20

0

Entering edit mode

If you did not trim then the reads would not be out of sync. There must be some other reason.

ADD REPLY • link 4.3 years ago by GenoMax 152k

score 0 · Answer 1 · 2021-06-15

Hi,

I ran into the same problem recently, where my paired-end run (2 x 41bp) was aligning < 1% but each mate separately was aligning 85%+.

It turns out my reads were reverse-complement, and after fixing this, I was able to get 90%+ alignment on my paired-end reads (slightly higher than each mate alone). I have Illumina data, so I was able to flip the "ReverseComplement" setting in the sample sheet, but bcl2fastq has an option to reverse complement as well.

I always ignored forward vs. reverse complement since the other alignment tools I use (bowtie2, kallisto) seem to not care. But it made a difference for me in STAR

Hope this helps!

score 0 · Answer 2 · 2021-06-15

0

Entering edit mode

4.1 years ago

swbarnes2 15k

Try separating the fastqs with a comma. Then try a comma and a space.

ADD COMMENT • link 4.1 years ago by swbarnes2 15k