How to get the number of exon within genes from GRCH37 reference
3
0
Entering edit mode
4.0 years ago
a.james ▴ 230

Dear All, How would get a list with number exons for all non-coding genes? Any suggestion would be really great Thank you

RNA-Seq next-gen gene • 1.5k views
ADD COMMENT
0
Entering edit mode

How would get a list with number exons for all non-coding genes?

It's unclear what you want to obtain, please elaborate. Ideal questions also have examples of input and output.

ADD REPLY
0
Entering edit mode

Non-coding genes? Do you mean pseudogenes? or non-coding transcripts?

ADD REPLY
3
Entering edit mode
4.0 years ago
EagleEye 6.8k

This I made it in very rough way. But it works. The below command will give you number of exons per transcript of non-coding regions from Ensembl GTF file. http://ftp.ensemblorg.ebi.ac.uk/pub/release-74/gtf/homo_sapiens/Homo_sapiens.GRCh37.74.gtf.gz

Create Exon numbers file with transcript ID:

zcat Homo_sapiens.GRCh37.74.gtf.gz | grep -v "protein_coding" | cut -f9 | sed 's/gene_id\ \"//' | sed 's/transcript_id\ \"//' | sed 's/exon_number\ \"//' | sed 's/gene_name\ \"//' | sed 's/gene_biotype\ \"//' | sed 's/\"\;/\t/g' | awk '{print $2"\t"$1"\t"$4"\t"$5;}' | awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "," } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' | awk -v FS="" 'BEGIN{print"Transcript\tExons"}{cnt=0;for (i=1;i<=NF;i++) if ($i==",") cnt++; print $0"\t"cnt}' | cut -f1,3 > transcript_with_numberOfExons.out

Create annotation file with transcript ID:

zcat Homo_sapiens.GRCh37.74.gtf.gz | cut -f9 | sed 's/gene_id\ \"//' | sed 's/transcript_id\ \"//' | sed 's/exon_number\ \"//' | sed 's/gene_name\ \"//' | sed 's/gene_biotype\ \"//' | sed 's/\"\;/\t/g' | awk '{print $2"\t"$1"\t"$4"\t"$5;}' | awk '!x[$0]++' > Ensembl_ENST_annotation.table

Merge the files to have complete transcript information:

join -j1 <(sort Ensembl_ENST_annotation.table) <(sort transcript_with_numberOfExons.out) -t $'\t' > Complete_annotation_exonNumbers.out

Final Output: (Last column is exon numbers for transcripts from first column)

ENST00000021776 ENSG00000020219 CCT8L1P pseudogene  1
ENST00000162863 ENSG00000186704 DTX2P1  pseudogene  9
ENST00000216407 ENSG00000256075 RP11-81H14.3    pseudogene  1
ENST00000219758 ENSG00000103472 RRN3P2  pseudogene  11
ENST00000221169 ENSG00000104725 NEFL    processed_transcript    4
ENST00000222301 ENSG00000105694 TCEB1P28    pseudogene  2
ENST00000222396 ENSG00000226472 RP11-551L14.4   lincRNA 4
ENST00000223076 ENSG00000184414 RP11-44M6.3 pseudogene  1
ENST00000225587 ENSG00000108452 ZNF29P  pseudogene  1
ENST00000225973 ENSG00000267267 CTD-3199J23.4   lincRNA 1

JFI: Single gene (ENSG) can have multiple transcripts/isoforms (ENSTs)

ADD COMMENT
1
Entering edit mode
4.0 years ago

using mysql UCSC/knownGene

$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -D hg19 -e 'select chrom,txStart,txEnd,name,strand,exonCount from knownGene where cdsStart=cdsEnd ' 
+-------+---------+-------+------------+--------+-----------+
| chrom | txStart | txEnd | name       | strand | exonCount |
+-------+---------+-------+------------+--------+-----------+
| chr1  |   11873 | 14409 | uc001aaa.3 | +      |         3 |
| chr1  |   11873 | 14409 | uc010nxr.1 | +      |         3 |
| chr1  |   14361 | 16765 | uc009vis.3 | -      |         4 |
| chr1  |   14361 | 19759 | uc009vit.3 | -      |         9 |
| chr1  |   14361 | 19759 | uc009viu.3 | -      |        10 |
| chr1  |   14361 | 19759 | uc001aae.4 | -      |        10 |
| chr1  |   14361 | 29370 | uc001aah.4 | -      |        11 |
| chr1  |   14361 | 29370 | uc009vir.3 | -      |        10 |
| chr1  |   14361 | 29370 | uc009viq.3 | -      |         7 |
| chr1  |   14361 | 29370 | uc001aac.4 | -      |        11 |
ADD COMMENT
0
Entering edit mode

Nice. I am just wondering, is there a way to get only non-coding regions directly from mysql ?

ADD REPLY
1
Entering edit mode

no, you need to use bedtools complement

ADD REPLY
0
Entering edit mode

Okay. Thanks a lot.

ADD REPLY
0
Entering edit mode
4.0 years ago
  1. Get GTF file
  2. Grep for some non-coding biotype(s)
  3. awk for exon
  4. Load into R
  5. reduce() in GenomicRanges
  6. done
ADD COMMENT

Login before adding your answer.

Traffic: 1760 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6