2
2
Entering edit mode
6.1 years ago
Jackie ▴ 70

Hi,

I am using STAR to align RNA-seq data, and I would like to make a QC metrics report out of the STAR logs. I want to do something like below:

2 Questions: 1) Seems in the *Log.final.out log file, it does not differentiate between read1 and read2. Is there any way I can get the information for read1 and read2, respectively from STAR output/log files? 2) In the *Log.final.out file, there are fields like 'Number of splices: GT/AG', 'Number of splices: GC/AG', 'Number of splices: AT/AC', do the corresponding numbers indicate reads count? Do you think these numbers can be used to calculate 'Reads with spliced alignment'?

Thanks!

STAR RNA-seq QC alignment • 5.2k views
1
Entering edit mode
6.1 years ago
apa@stowers ▴ 580

I don't know how to get end-1 vs end-2 alignment counts from STAR logs, I hope someone out there does! In the meantime, the manual says this (emphasis mine):

The statistics are calculated for each read (single- or paired-end) and then summed or averaged over all reads. Note that STAR counts a paired-end read as one read, (unlike the samtools flagstat/idxstats, which count each mate separately).

For splices I am also confused, here is an example I was just working on from Drosophila using 300bp paired-end MiSeq reads. Log.final.out shows this:

           Number of splices: Total |   362050
Number of splices: Annotated (sjdb) |   347797
Number of splices: GT/AG |   350494
Number of splices: GC/AG |   1618
Number of splices: AT/AC |   92
Number of splices: Non-canonical |   9846


Lines 3-6 sum to line 1, but that is where the math ceases to work. The manual goes on to say (emphasis mine),

Most of the information is collected about the UNIQUE mappers (unlike samtools flagstat/idxstats which does not separate unique or multi-mappers). Each splicing is counted in the numbers of splices, which would correspond to summing the counts in SJ.out.tab. The mismatch/indel error rates are calculated on a per base basis, i.e. as total number of mismatches/indels in all unique mappers divided by the total number of mapped bases.

I have tried various approaches to counting the values in columns 7 and 8 of SJ.out.tab -- they do not add up to 362050, or 350494, or anything like them. I have also tried a wide variety of approaches to regenerate the splicing numbers in the Log.final.out from Aligned.out.bam, and cannot do it. The very closest I could get was taking primary alignments from unique mappers and counting the total number of splices in the cigars:

samtools view -F 256 Aligned.out.bam | grep "NH:i:1" | cut -f6 | awk '{ while (sub(/N/,"")) j++ } END { print j }'


Which gets 362267 -- not 362050, but very close, closer than the other dozen approaches I tried. Mind you, this is the number of recorded splices, NOT the number of reads. For the reads,

samtools view -F 256 Aligned.out.bam | grep "NH:i:1" | cut -f6 | grep N | wc -l


Which is only 292903.

I hope this provides some insight -- and I hope someone else can give a better answer!

0
Entering edit mode

Thanks apa@stowers for the very detailed response, that helps.

Yes, I found the spliced alignment particularly confusing. And, I also tried samtools flagstat for stats on read1/read2 separately, seems it would not give you the number same as in STAR log output. I.e., the total number of reads in each of the two fastqs is 2,679,927, and what samtools flagstat shows is 2746705 for read1 and 2746088 for read2, which are bigger than the total read number :(, I guess that might be largely due to samtools does not differentiate uniquely mapped vs. multi-mapped reads? or something else?

Thanks again apa@stowers, and if someone else has such experience, please advise.