Question

number of exons

0

Entering edit mode

2.2 years ago

anba • 0

Hi,

I need to obtain list with four columns

gene_name genomic_length max_number_of_exons max_number_of_introns

and I can't find this information in aggegation format. I need to find the maximum number of exoms and introns for each gene.

I tried ensembl and uscs, and there is only information about number of transcripts, but I need the length and number of exon/introns. Where should I obtain such a list?

database • 979 views

ADD COMMENT • link updated 2.2 years ago by cpad0112 21k • written 2.2 years ago by anba • 0

0

Entering edit mode

Each gene has multiple transcripts which vary in length and or nu,ber of exons, so the table you describe either does not exist or is inherently incorrect.

ADD REPLY • link 2.2 years ago by WouterDeCoster 47k

0

Entering edit mode

It seems that I write it unclear. I need the genomic length (not transcripts) and the maximum number of exons/introns for each gene. I've corrected my question.

ADD REPLY • link 2.2 years ago by anba • 0

0

Entering edit mode

$ awk '$3=="exon" && $2 == "BestRefSeq" {print}'  GRCh38_latest_genomic.gtf| gffread -F --keep-exon-attrs --table "gene_id","transcript_id",@numexons | sort -k1,1 | head -20

A1BG    NM_130786.4 8
A1BG-AS1    NR_015380.2 4
A1CF    NM_001198818.2  14
A1CF    NM_001198819.2  15
A1CF    NM_001198820.2  14
A1CF    NM_001370130.1  12
A1CF    NM_001370131.1  12
A1CF    NM_014576.4 13
A1CF    NM_138932.3 13
A1CF    NM_138933.3 13
A2M NM_000014.6 36
A2M NM_001347423.2  37
A2M NM_001347424.2  36
A2M NM_001347425.2  35
A2M-AS1 NR_026971.1 3
A2M-AS1 NR_137424.1 2
A2M-AS1 NR_137425.1 2
A2ML1   NM_001282424.3  25
A2ML1   NM_144670.6 36
A2MP1   NR_040112.1 9

You can take the maximum, from third column:

$ awk '$3=="exon" && $2 == "BestRefSeq" {print}'  GRCh38_latest_genomic.gtf| gffread -F --keep-exon-attrs --table "gene_id","transcript_id",@numexons | sort -k1,1 | datamash -sg1 max 3 | head

A1BG    8
A1BG-AS1    4
A1CF    15
A2M 37
A2M-AS1 3
A2ML1   36
A2MP1   9
A3GALT2 5
A4GALT  3
A4GNT   3

ADD REPLY • link 2.2 years ago by cpad0112 21k

score 5 · Accepted Answer · 2022-01-25

5

Entering edit mode

2.2 years ago

Pierre Lindenbaum 161k

max number of exons per gene:

$ wget -q  -O - "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/wgEncodeGencodeBasicV19.txt.gz" | gunzip -c |\
    awk '{printf("%s\t%s\n",$13,$9);}' |\
    sort -t $'\t' -k1,1 -k2,2nr |\
    sort -t $'\t' -k1,1 -u --stable 

5S_rRNA 1
7SK 3
A1BG    8
A1BG-AS1    4
A1CF    15
A2M 36
A2M-AS1 3
A2ML1   36
A2ML1-AS1   2
A2ML1-AS2   2
A2MP1   9
A3GALT2 5
A4GALT  2
A4GNT   3
AAAS    16
AACS    18
AACSP1  11
AADAC   5
AADACL2 5
AADACL3 4
AADACL4 4
AADAT   14
AAED1   6
AAGAB   10
AAK1    22
AAMDC   7
AAMP    11
AANAT   7
AAR2    4
AARD    2

ADD COMMENT • link 2.2 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

$ datamash -sg13 max 9 < wgEncodeGencodeBasicV19.txt | head -20

5S_rRNA 1
7SK 3
A1BG    8
A1BG-AS1    4
A1CF    15
A2M 36
A2M-AS1 3
A2ML1   36
A2ML1-AS1   2
A2ML1-AS2   2
A2MP1   9
A3GALT2 5
A4GALT  2
A4GNT   3
AAAS    16
AACS    18
AACSP1  11
AADAC   5
AADACL2 5
AADACL3 4

ADD REPLY • link 2.2 years ago by cpad0112 21k