Question: Alignment stats summary by STAR aligner, please help!
1
gravatar for Jackie
13 days ago by
Jackie30
United States
Jackie30 wrote:

Hi,

I am using STAR to align RNA-seq data, and I would like to make a QC metrics report out of the STAR logs. I want to do something like below:

Total Aligned Reads for read1 and read 2 (% Reads)

Abundant Reads for read1 and read 2 (% Reads)

Unaligned Reads for read1 and read 2 (% Reads)

Reads with spliced alignment for read1 and read 2 (% Aligned Reads)

Reads aligned at multiple loci for read1 and read 2(% Aligned Reads)

2 Questions: 1) Seems in the *Log.final.out log file, it does not differentiate between read1 and read2. Is there any way I can get the information for read1 and read2, respectively from STAR output/log files? 2) In the *Log.final.out file, there are fields like 'Number of splices: GT/AG', 'Number of splices: GC/AG', 'Number of splices: AT/AC', do the corresponding numbers indicate reads count? Do you think these numbers can be used to calculate 'Reads with spliced alignment'?

Thanks!

rna-seq star qc alignment • 144 views
ADD COMMENTlink modified 13 days ago • written 13 days ago by Jackie30
0
gravatar for apa@stowers
13 days ago by
apa@stowers320
Kansas City
apa@stowers320 wrote:

I don't know how to get end-1 vs end-2 alignment counts from STAR logs, I hope someone out there does! In the meantime, the manual says this (emphasis mine):

The statistics are calculated for each read (single- or paired-end) and then summed or averaged over all reads. Note that STAR counts a paired-end read as one read, (unlike the samtools flagstat/idxstats, which count each mate separately).

For splices I am also confused, here is an example I was just working on from Drosophila using 300bp paired-end MiSeq reads. Log.final.out shows this:

           Number of splices: Total |   362050
Number of splices: Annotated (sjdb) |   347797
           Number of splices: GT/AG |   350494
           Number of splices: GC/AG |   1618
           Number of splices: AT/AC |   92
   Number of splices: Non-canonical |   9846

Lines 3-6 sum to line 1, but that is where the math ceases to work. The manual goes on to say (emphasis mine),

Most of the information is collected about the UNIQUE mappers (unlike samtools flagstat/idxstats which does not separate unique or multi-mappers). Each splicing is counted in the numbers of splices, which would correspond to summing the counts in SJ.out.tab. The mismatch/indel error rates are calculated on a per base basis, i.e. as total number of mismatches/indels in all unique mappers divided by the total number of mapped bases.

I have tried various approaches to counting the values in columns 7 and 8 of SJ.out.tab -- they do not add up to 362050, or 350494, or anything like them. I have also tried a wide variety of approaches to regenerate the splicing numbers in the Log.final.out from Aligned.out.bam, and cannot do it. The very closest I could get was taking primary alignments from unique mappers and counting the total number of splices in the cigars:

samtools view -F 256 Aligned.out.bam | grep "NH:i:1" | cut -f6 | awk '{ while (sub(/N/,"")) j++ } END { print j }'

Which gets 362267 -- not 362050, but very close, closer than the other dozen approaches I tried. Mind you, this is the number of recorded splices, NOT the number of reads. For the reads,

samtools view -F 256 Aligned.out.bam | grep "NH:i:1" | cut -f6 | grep N | wc -l

Which is only 292903.

I hope this provides some insight -- and I hope someone else can give a better answer!

ADD COMMENTlink modified 13 days ago • written 13 days ago by apa@stowers320

Thanks apa@stowers for the very detailed response, that helps.

Yes, I found the spliced alignment particularly confusing. And, I also tried samtools flagstat for stats on read1/read2 separately, seems it would not give you the number same as in STAR log output. I.e., the total number of reads in each of the two fastqs is 2,679,927, and what samtools flagstat shows is 2746705 for read1 and 2746088 for read2, which are bigger than the total read number :(, I guess that might be largely due to samtools does not differentiate uniquely mapped vs. multi-mapped reads? or something else?

Thanks again apa@stowers, and if someone else has such experience, please advise.

ADD REPLYlink written 12 days ago by Jackie30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1058 users visited in the last hour