Question: How to extract reads supporting protein-coding gene splice junctions from STAR's SJ.out.tab output
1
gravatar for Nicolas Rosewick
4.5 years ago by
Belgium, Brussels
Nicolas Rosewick8.5k wrote:

Hi,

I aligned my RNA-Seq reads using STAR on human hg19 genome index build with ENSEMBL annotation (--sjdbGTFfile  option at the genome index generation). So in my SJ.out.tab file I will have informations about the number of reads supporting splice junction as described in STAR manual :

column 1: chromosome
column 2: first base of the intron (1-based)
column 3: last base of the intron (1-based)
column 4: strand (0: undefined, 1: +, 2: -)
column 5: intron motif: 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5:AT/AC, 6: GT/AT
column 6: 0: unannotated, 1: annotated (only if splice junctions database is used)
column 7: number of uniquely mapping reads crossing the junction
column 8: number of multi-mapping reads crossing the junction
column 9: maximum spliced alignment overhang

In ENSEMBL annotation, there are all types of RNA (mRNA, tRNA, lncRNA,...). But for my research I've to compute the number of uniquely mapped reads crossing only protein-coding genes splice junction. Is there a way to download splice acceptor/donor position in hg19 for protein-coding genes in order to substract the read count from the SJ.out.tab file ?

 

Thanks

splice star sj.out.tab • 4.2k views
ADD COMMENTlink modified 4.5 years ago by Devon Ryan93k • written 4.5 years ago by Nicolas Rosewick8.5k
3
gravatar for Devon Ryan
4.5 years ago by
Devon Ryan93k
Freiburg, Germany
Devon Ryan93k wrote:

HISAT actually comes with a python script that can extract splice sites from a GTF file. So just take your GTF file, extract only the types of entries that you want to include, run extract_splice_sites.py on the results and then intersect the results appropriately.

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Devon Ryan93k
1

I re-wrote that script to also include input via pipe, so there's no need for the intermediate files.  The only thing to add is that it outputs 0-based offsets for use by HISAT internally - both GTF and SAM are 1-based formats, so perhaps take that into account.

ADD REPLYlink written 4.5 years ago by george.ry1.1k

Thanks Devon. That's perfect. I extract the splice sites and overlapped them with my data using R.

ADD REPLYlink written 4.5 years ago by Nicolas Rosewick8.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1849 users visited in the last hour