How to extract reads supporting protein-coding gene splice junctions from STAR's SJ.out.tab output
1
1
Entering edit mode
8.9 years ago

Hi,

I aligned my RNA-Seq reads using STAR on human hg19 genome index build with ENSEMBL annotation (--sjdbGTFfile option at the genome index generation). So in my SJ.out.tab file I will have information about the number of reads supporting splice junction as described in STAR manual :

column 1: chromosome
column 2: first base of the intron (1-based)
column 3: last base of the intron (1-based)
column 4: strand (0: undefined, 1: +, 2: -)
column 5: intron motif: 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5:AT/AC, 6: GT/AT
column 6: 0: unannotated, 1: annotated (only if splice junctions database is used)
column 7: number of uniquely mapping reads crossing the junction
column 8: number of multi-mapping reads crossing the junction
column 9: maximum spliced alignment overhang

In ENSEMBL annotation, there are all types of RNA (mRNA, tRNA, lncRNA,...). But for my research I've to compute the number of uniquely mapped reads crossing only protein-coding genes splice junction. Is there a way to download splice acceptor/donor position in hg19 for protein-coding genes in order to subtract the read count from the SJ.out.tab file?

Thanks

splice STAR • 5.9k views
ADD COMMENT
3
Entering edit mode
8.9 years ago

HISAT actually comes with a python script that can extract splice sites from a GTF file. So just take your GTF file, extract only the types of entries that you want to include, run extract_splice_sites.py on the results and then intersect the results appropriately.

ADD COMMENT
1
Entering edit mode

I re-wrote that script to also include input via pipe, so there's no need for the intermediate files. The only thing to add is that it outputs 0-based offsets for use by HISAT internally - both GTF and SAM are 1-based formats, so perhaps take that into account.

ADD REPLY
0
Entering edit mode

Thanks Devon. That's perfect. I extract the splice sites and overlapped them with my data using R.

ADD REPLY

Login before adding your answer.

Traffic: 2617 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6