Question

Alignment stats summary by STAR aligner, please help!

2

Entering edit mode

7.3 years ago

Jackie ▴ 70

Hi,

I am using STAR to align RNA-seq data, and I would like to make a QC metrics report out of the STAR logs. I want to do something like below:

Total Aligned Reads for read1 and read 2 (% Reads)

Abundant Reads for read1 and read 2 (% Reads)

Unaligned Reads for read1 and read 2 (% Reads)

Reads with spliced alignment for read1 and read 2 (% Aligned Reads)

Reads aligned at multiple loci for read1 and read 2(% Aligned Reads)

2 Questions: 1) Seems in the *Log.final.out log file, it does not differentiate between read1 and read2. Is there any way I can get the information for read1 and read2, respectively from STAR output/log files? 2) In the *Log.final.out file, there are fields like 'Number of splices: GT/AG', 'Number of splices: GC/AG', 'Number of splices: AT/AC', do the corresponding numbers indicate reads count? Do you think these numbers can be used to calculate 'Reads with spliced alignment'?

Thanks!

STAR RNA-seq QC alignment • 6.0k views

ADD COMMENT • link 7.3 years ago by Jackie ▴ 70

score 1 · Answer 1 · 2017-01-11

I don't know how to get end-1 vs end-2 alignment counts from STAR logs, I hope someone out there does! In the meantime, the manual says this (emphasis mine):

The statistics are calculated for each read (single- or paired-end) and then summed or averaged over all reads. Note that STAR counts a paired-end read as one read, (unlike the samtools flagstat/idxstats, which count each mate separately).

For splices I am also confused, here is an example I was just working on from Drosophila using 300bp paired-end MiSeq reads. Log.final.out shows this:

           Number of splices: Total |   362050
Number of splices: Annotated (sjdb) |   347797
           Number of splices: GT/AG |   350494
           Number of splices: GC/AG |   1618
           Number of splices: AT/AC |   92
   Number of splices: Non-canonical |   9846

Lines 3-6 sum to line 1, but that is where the math ceases to work. The manual goes on to say (emphasis mine),

Most of the information is collected about the UNIQUE mappers (unlike samtools flagstat/idxstats which does not separate unique or multi-mappers). Each splicing is counted in the numbers of splices, which would correspond to summing the counts in SJ.out.tab. The mismatch/indel error rates are calculated on a per base basis, i.e. as total number of mismatches/indels in all unique mappers divided by the total number of mapped bases.

I have tried various approaches to counting the values in columns 7 and 8 of SJ.out.tab -- they do not add up to 362050, or 350494, or anything like them. I have also tried a wide variety of approaches to regenerate the splicing numbers in the Log.final.out from Aligned.out.bam, and cannot do it. The very closest I could get was taking primary alignments from unique mappers and counting the total number of splices in the cigars:

samtools view -F 256 Aligned.out.bam | grep "NH:i:1" | cut -f6 | awk '{ while (sub(/N/,"")) j++ } END { print j }'

Which gets 362267 -- not 362050, but very close, closer than the other dozen approaches I tried. Mind you, this is the number of recorded splices, NOT the number of reads. For the reads,

samtools view -F 256 Aligned.out.bam | grep "NH:i:1" | cut -f6 | grep N | wc -l

Which is only 292903.

I hope this provides some insight -- and I hope someone else can give a better answer!