Question: Intron and exon coordinates of all genes from Hg19
0
gravatar for yliueagle
11 months ago by
yliueagle220
France
yliueagle220 wrote:

Is there a way to get Intron and exon coordinates of all genes from Hg19? The exons can be got from this post: Exon Coordinates Of Hg19 Genome Download, but how can I also get intron coordinates? I would keep the corresponding gene names in the result

Thanks!

exon intron cordinates gene • 1.0k views
ADD COMMENTlink modified 11 months ago by vkkodali2.2k • written 11 months ago by yliueagle220

If you have an annotation file:

A: how to get intronic and intergenic sequences based on gff file?

ADD REPLYlink written 11 months ago by ATpoint41k
0
gravatar for RamRS
11 months ago by
RamRS30k
Baylor College of Medicine, Houston, TX
RamRS30k wrote:

You can use R to add 1 to the 5th field and subtract 1 from the 4th field in the next line (as long as gene identifier matches) in the curl solution from that post. You may need a custom function.

Or you can create a bed file from the exon coordinates, another bed file with one line per chromosome, with start = min(all exon start coordinates for that chr) and end=max(all exon end coordinates for that chr). You can then use bedtools or bedops to subtract the exon bed from the other bed to get all introns. I'd personally go with this solution as it would be easier to implement.

ADD COMMENTlink modified 11 months ago • written 11 months ago by RamRS30k
0
gravatar for Pierre Lindenbaum
11 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:
$ curl  -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz" | gunzip -c | head | awk '{n=int($8); split($9,S,/,/);split($10,E,/,/); for(i=1;i<=n;++i) {printf("%s,%s,%s,%s,%s,EXON\n",$1,$2,$3,S[i],E[i]); if(i+1<=n) printf("%s,%s,%s,%s,%s,INTRON\n",$1,$2,$3,int(E[i]),int(S[i+1]));  }}' 
uc001aaa.3,chr1,+,11873,12227,EXON
uc001aaa.3,chr1,+,12227,12612,INTRON
uc001aaa.3,chr1,+,12612,12721,EXON
uc001aaa.3,chr1,+,12721,13220,INTRON
uc001aaa.3,chr1,+,13220,14409,EXON
uc010nxr.1,chr1,+,11873,12227,EXON
uc010nxr.1,chr1,+,12227,12645,INTRON
uc010nxr.1,chr1,+,12645,12697,EXON
uc010nxr.1,chr1,+,12697,13220,INTRON
uc010nxr.1,chr1,+,13220,14409,EXON
uc010nxq.1,chr1,+,11873,12227,EXON
uc010nxq.1,chr1,+,12227,12594,INTRON
uc010nxq.1,chr1,+,12594,12721,EXON
uc010nxq.1,chr1,+,12721,13402,INTRON
uc010nxq.1,chr1,+,13402,14409,EXON
uc009vis.3,chr1,-,14361,14829,EXON
uc009vis.3,chr1,-,14829,14969,INTRON
uc009vis.3,chr1,-,14969,15038,EXON
uc009vis.3,chr1,-,15038,15795,INTRON
uc009vis.3,chr1,-,15795,15942,EXON
uc009vis.3,chr1,-,15942,16606,INTRON
uc009vis.3,chr1,-,16606,16765,EXON
uc009vit.3,chr1,-,14361,14829,EXON
uc009vit.3,chr1,-,14829,14969,INTRON
(...)
ADD COMMENTlink written 11 months ago by Pierre Lindenbaum131k
0
gravatar for vkkodali
11 months ago by
vkkodali2.2k
United States
vkkodali2.2k wrote:

Here's a way to extract exon and intron features from GFF3 files in BED format. I am using the RefSeq's latest GRCh37 (hg19) annotation but I'd expect the following code to work (may be with minimal changes) with any other GFF3 file.

I make a distinction between 'all' and 'unique' features as follows. Let's consider the following gene:

GENE            ######################################################
ISOFORM-1       #########--------#######-------####----------#########
                E1.1             E1.2          E1.3          E1.4
                         I1.1           I1.2       I1.3
ISOFORM-2       #########--------###########---------###-----#########
                E2.1             E2.2                E2.3    E2.4
                         I2.1               I2.2        I2.3

In my 'all' output files, each and every exon and intron will be returned. In my 'uniq' output files, identical features are omitted. So, the 'uniq' exon file will include only one of E1.1/E2.1 and E1.4/E2.4, and the uniq intron file will include only one of I1.1/I2.1.

## download GFF3 file
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20190906/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz

## extract all exons
zgrep -v '^#' GCF_000001405.25_GRCh37.p13_genomic.gff.gz \
  | awk 'BEGIN{FS="\t";OFS="\t"}($3=="exon" && $9~/transcript_id=/)' \
  | cut -f1,4,5,7,9 \
  | perl -pe 's/ID.*gene=([^;]*).*transcript_id=([^;]*).*/\1|\2/g' \
  | awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$2-1,$3,$5,"1000",$4}' \
  | sort -k1,1 -k2,2n > hg19_exons.all.bed 

## extract unique exons  
zgrep -v '^#' GCF_000001405.25_GRCh37.p13_genomic.gff.gz \
  | awk 'BEGIN{FS="\t";OFS="\t"}($3=="exon" && $9~/transcript_id=/)' \
  | cut -f1,4,5,7,9 \
  | perl -pe 's/ID.*gene=([^;]*).*/\1/g' \
  | awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$2-1,$3,$5,"1000",$4}' \
  | sort -u | sort -k1,1 -k2,2n > hg19_exons.uniq.bed 

## extract all introns 
zgrep -v '^#' GCF_000001405.25_GRCh37.p13_genomic.gff.gz \
  | awk 'BEGIN{FS="\t";OFS="\t"}($3=="exon" && $9~/transcript_id/)' \
  | perl -pe 's/ID.*gene=([^;]*).*transcript_id=([^;]*).*/\1|\2/g' \
  | cut -f1,4,5,7,9 | sort -k1,1 -k5,5 -k2,2n -k3,3n \
  | awk 'BEGIN{FS="\t";OFS="\t";chr=0;d=0;rs="NA"}{if(chr==$1 && rs==$5){print $1,d,$2-1,rs,0,$4; d=$3}else{chr=$1;rs=$5;d=$3}}' \
  | sort -u | sort -k1,1 -k2,2n > hg19_introns.all.bed 

## extract unique introns 
zgrep -v '^#' GCF_000001405.25_GRCh37.p13_genomic.gff.gz \
  | awk 'BEGIN{FS="\t";OFS="\t"}($3=="exon" && $9~/transcript_id/)' \
  | perl -pe 's/ID.*gene=([^;]*).*transcript_id=([^;]*).*/\1|\2/g' \
  | cut -f1,4,5,7,9 | sort -k1,1 -k5,5 -k2,2n -k3,3n \
  | awk 'BEGIN{FS="\t";OFS="\t";chr=0;d=0;rs="NA"}{if(chr==$1 && rs==$5){print $1,d,$2-1,rs,0,$4; d=$3}else{chr=$1;rs=$5;d=$3}}' \
  | awk 'BEGIN{FS="\t";OFS="\t"}{split($4,a,"|"); $4=a[1]; print $0}' \
  | sort -u | sort -k1,1 -k2,2n > hg19_introns.uniq.bed
ADD COMMENTlink modified 11 months ago • written 11 months ago by vkkodali2.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1904 users visited in the last hour