How can I classify circRNA as exonic, intronic or intergenic from the output of find_circ
1
0
Entering edit mode
6.0 years ago

I have a list of circRNAs identified by circRNA identification tool find_circ. How can I classify these circRNAs as exonic, intronic or intergenic? Is there any tools or script? some lines of output from find_circ are given below:

# chrom start   end name    n_reads strand  n_uniq  best_qual_A best_qual_B

chr4    166006737   166024248   Sy5y_D0_circ_000001 2   -   1   5   40
chr7    101950003   101952188   Sy5y_D0_circ_000002 1   +   1   5   5
chr5    619104  620376  Sy5y_D0_circ_000003 2   +   2   5   40

Thanks in advance.

circRNA RNA-Seq • 2.0k views
ADD COMMENT
2
Entering edit mode
6.0 years ago

If you wanted comprehensive annotation for your circular RNAs, I would use BEDTools to overlap your regions with the GENCODE comprehensive GTF annotation. This has the co-ordinates of the upward of 200,000 transcripts (and their isoforms) identified by the Encode project.

  1. Download the Comprehensive gene annotation from https://www.gencodegenes.org/releases/current.html (hg38) (for hg19: https://www.gencodegenes.org/releases/grch37_mapped_releases.html )
  2. overlap your find_circ output with these regions using BEDTools:

.

bedtools intersect -a find_circ.output.txt -b gencode.v28.annotation.gtf.gz

For more information on BEDTools intersect, see: http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html

Kevin

ADD COMMENT
0
Entering edit mode

Thank you for your response. Running the above command I got the following type of results.

chr4    166007942   166008048   Sy5y_D0_circ_000001 2   -   1   5   40
chr4    166007942   166008048   Sy5y_D0_circ_000001 2   -   1   5   40
chr4    166014435   166014560   Sy5y_D0_circ_000001 2   -   1   5   40

But I want the results as follows:

chr4    166006737   166024248     exon
chr7    101950003   101952188     intron
chr5    619104  620376    intergenic

Any suggestions will be appreciated.

ADD REPLY
0
Entering edit mode

I see, please try this, instead:

bedtools intersect -a -circ_rna.bed -b gencode.v28.annotation.gtf -wb
chr4    166007942   166008048   Sy5y_D0_circ_000001 2   -   1   5   40  chr4    HAVANA  exon    166007943   166008048   .   +   .   gene_id "ENSG00000038295.7"; transcript_id "ENST00000061240.6"; gene_type "protein_coding"; gene_name "TLL1"; transcript_type "protein_coding"; transcript_name "RP11-624O16.1-001"; exon_number 7; exon_id "ENSE00003485218.1"; level 2; protein_id "ENSP00000061240.2"; transcript_support_level "1"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS3811.1"; havana_gene "OTTHUMG00000161112.3"; havana_transcript "OTTHUMT00000363821.1";
chr4    166007942   166008048   Sy5y_D0_circ_000001 2   -   1   5   40  chr4    HAVANA  CDS 166007943   166008048   .   +   2   gene_id "ENSG00000038295.7"; transcript_id "ENST00000061240.6"; gene_type "protein_coding"; gene_name "TLL1"; transcript_type "protein_coding"; transcript_name "RP11-624O16.1-001"; exon_number 7; exon_id "ENSE00003485218.1"; level 2; protein_id "ENSP00000061240.2"; transcript_support_level "1"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS3811.1"; havana_gene "OTTHUMG00000161112.3"; havana_transcript "OTTHUMT00000363821.1";
chr4    166014435   166014560   Sy5y_D0_circ_000001 2   -   1   5   40  chr4    HAVANA  exon    166014436   166014560   .   +   .   gene_id "ENSG00000038295.7"; transcript_id "ENST00000061240.6"; gene_type "protein_coding"; gene_name "TLL1"; transcript_type "protein_coding"; transcript_name "RP11-624O16.1-001"; exon_number 8; exon_id "ENSE00003496842.1"; level 2; protein_id "ENSP00000061240.2"; transcript_support_level "1"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS3811.1"; havana_gene "OTTHUMG00000161112.3"; havana_transcript "OTTHUMT00000363821.1";

If you want to tidy this output, then pipe it into the cut BASH command.

This will only return all UTR, CDS, and exons, though, because that is what is included in the GENCODE GTF files. However, it contains all currently known non-coding RNA species. If you want introns and intergenic regions, then I suggest different options:

An issue that you face with these regions is that they overlap both introns and exons concurrently, i.e., they are very large circicular RNAs.

Kevin

ADD REPLY

Login before adding your answer.

Traffic: 2616 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6