Question

How to get the number of exon within genes from GRCH37 reference

0

Entering edit mode

8.2 years ago

a.james ▴ 240

Dear All, How would get a list with number exons for all non-coding genes? Any suggestion would be really great Thank you

RNA-Seq next-gen gene • 2.9k views

ADD COMMENT • link updated 8.2 years ago by Pierre Lindenbaum 166k • written 8.2 years ago by a.james ▴ 240

0

Entering edit mode

How would get a list with number exons for all non-coding genes?

It's unclear what you want to obtain, please elaborate. Ideal questions also have examples of input and output.

ADD REPLY • link 8.2 years ago by WouterDeCoster 48k

0

Entering edit mode

Non-coding genes? Do you mean pseudogenes? or non-coding transcripts?

ADD REPLY • link 8.2 years ago by agata88 ▴ 870

1

Entering edit mode

8.2 years ago

Pierre Lindenbaum 166k

using mysql UCSC/knownGene

$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -D hg19 -e 'select chrom,txStart,txEnd,name,strand,exonCount from knownGene where cdsStart=cdsEnd ' 
+-------+---------+-------+------------+--------+-----------+
| chrom | txStart | txEnd | name       | strand | exonCount |
+-------+---------+-------+------------+--------+-----------+
| chr1  |   11873 | 14409 | uc001aaa.3 | +      |         3 |
| chr1  |   11873 | 14409 | uc010nxr.1 | +      |         3 |
| chr1  |   14361 | 16765 | uc009vis.3 | -      |         4 |
| chr1  |   14361 | 19759 | uc009vit.3 | -      |         9 |
| chr1  |   14361 | 19759 | uc009viu.3 | -      |        10 |
| chr1  |   14361 | 19759 | uc001aae.4 | -      |        10 |
| chr1  |   14361 | 29370 | uc001aah.4 | -      |        11 |
| chr1  |   14361 | 29370 | uc009vir.3 | -      |        10 |
| chr1  |   14361 | 29370 | uc009viq.3 | -      |         7 |
| chr1  |   14361 | 29370 | uc001aac.4 | -      |        11 |

ADD COMMENT • link 8.2 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Nice. I am just wondering, is there a way to get only non-coding regions directly from mysql ?

ADD REPLY • link 8.2 years ago by EagleEye 7.6k

1

Entering edit mode

no, you need to use bedtools complement

ADD REPLY • link 8.2 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Okay. Thanks a lot.

ADD REPLY • link 8.2 years ago by EagleEye 7.6k

0

Entering edit mode

8.2 years ago

Devon Ryan 105k

Get GTF file
Grep for some non-coding biotype(s)
awk for exon
Load into R
reduce() in GenomicRanges
done

ADD COMMENT • link 8.2 years ago by Devon Ryan 105k

score 3 · Accepted Answer · 2017-04-20

This I made it in very rough way. But it works. The below command will give you number of exons per transcript of non-coding regions from Ensembl GTF file. http://ftp.ensemblorg.ebi.ac.uk/pub/release-74/gtf/homo_sapiens/Homo_sapiens.GRCh37.74.gtf.gz

Create Exon numbers file with transcript ID:

zcat Homo_sapiens.GRCh37.74.gtf.gz | grep -v "protein_coding" | cut -f9 | sed 's/gene_id\ \"//' | sed 's/transcript_id\ \"//' | sed 's/exon_number\ \"//' | sed 's/gene_name\ \"//' | sed 's/gene_biotype\ \"//' | sed 's/\"\;/\t/g' | awk '{print $2"\t"$1"\t"$4"\t"$5;}' | awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "," } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' | awk -v FS="" 'BEGIN{print"Transcript\tExons"}{cnt=0;for (i=1;i<=NF;i++) if ($i==",") cnt++; print $0"\t"cnt}' | cut -f1,3 > transcript_with_numberOfExons.out

Create annotation file with transcript ID:

zcat Homo_sapiens.GRCh37.74.gtf.gz | cut -f9 | sed 's/gene_id\ \"//' | sed 's/transcript_id\ \"//' | sed 's/exon_number\ \"//' | sed 's/gene_name\ \"//' | sed 's/gene_biotype\ \"//' | sed 's/\"\;/\t/g' | awk '{print $2"\t"$1"\t"$4"\t"$5;}' | awk '!x[$0]++' > Ensembl_ENST_annotation.table

Merge the files to have complete transcript information:

join -j1 <(sort Ensembl_ENST_annotation.table) <(sort transcript_with_numberOfExons.out) -t $'\t' > Complete_annotation_exonNumbers.out

Final Output: (Last column is exon numbers for transcripts from first column)

ENST00000021776 ENSG00000020219 CCT8L1P pseudogene  1
ENST00000162863 ENSG00000186704 DTX2P1  pseudogene  9
ENST00000216407 ENSG00000256075 RP11-81H14.3    pseudogene  1
ENST00000219758 ENSG00000103472 RRN3P2  pseudogene  11
ENST00000221169 ENSG00000104725 NEFL    processed_transcript    4
ENST00000222301 ENSG00000105694 TCEB1P28    pseudogene  2
ENST00000222396 ENSG00000226472 RP11-551L14.4   lincRNA 4
ENST00000223076 ENSG00000184414 RP11-44M6.3 pseudogene  1
ENST00000225587 ENSG00000108452 ZNF29P  pseudogene  1
ENST00000225973 ENSG00000267267 CTD-3199J23.4   lincRNA 1

JFI: Single gene (ENSG) can have multiple transcripts/isoforms (ENSTs)