Question: Intron and exon coordinates of all genes from Hg19
0
gravatar for yliueagle
5 weeks ago by
yliueagle220
France
yliueagle220 wrote:

Is there a way to get Intron and exon coordinates of all genes from Hg19? The exons can be got from this post: Exon Coordinates Of Hg19 Genome Download, but how can I also get intron coordinates? I would keep the corresponding gene names in the result

Thanks!

exon intron cordinates gene • 157 views
ADD COMMENTlink modified 5 weeks ago by vkkodali1.6k • written 5 weeks ago by yliueagle220

If you have an annotation file:

A: how to get intronic and intergenic sequences based on gff file?

ADD REPLYlink written 5 weeks ago by ATpoint28k
0
gravatar for RamRS
5 weeks ago by
RamRS25k
Houston, TX
RamRS25k wrote:

You can use R to add 1 to the 5th field and subtract 1 from the 4th field in the next line (as long as gene identifier matches) in the curl solution from that post. You may need a custom function.

Or you can create a bed file from the exon coordinates, another bed file with one line per chromosome, with start = min(all exon start coordinates for that chr) and end=max(all exon end coordinates for that chr). You can then use bedtools or bedops to subtract the exon bed from the other bed to get all introns. I'd personally go with this solution as it would be easier to implement.

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by RamRS25k
0
gravatar for Pierre Lindenbaum
5 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum125k wrote:
$ curl  -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz" | gunzip -c | head | awk '{n=int($8); split($9,S,/,/);split($10,E,/,/); for(i=1;i<=n;++i) {printf("%s,%s,%s,%s,%s,EXON\n",$1,$2,$3,S[i],E[i]); if(i+1<=n) printf("%s,%s,%s,%s,%s,INTRON\n",$1,$2,$3,int(E[i]),int(S[i+1]));  }}' 
uc001aaa.3,chr1,+,11873,12227,EXON
uc001aaa.3,chr1,+,12227,12612,INTRON
uc001aaa.3,chr1,+,12612,12721,EXON
uc001aaa.3,chr1,+,12721,13220,INTRON
uc001aaa.3,chr1,+,13220,14409,EXON
uc010nxr.1,chr1,+,11873,12227,EXON
uc010nxr.1,chr1,+,12227,12645,INTRON
uc010nxr.1,chr1,+,12645,12697,EXON
uc010nxr.1,chr1,+,12697,13220,INTRON
uc010nxr.1,chr1,+,13220,14409,EXON
uc010nxq.1,chr1,+,11873,12227,EXON
uc010nxq.1,chr1,+,12227,12594,INTRON
uc010nxq.1,chr1,+,12594,12721,EXON
uc010nxq.1,chr1,+,12721,13402,INTRON
uc010nxq.1,chr1,+,13402,14409,EXON
uc009vis.3,chr1,-,14361,14829,EXON
uc009vis.3,chr1,-,14829,14969,INTRON
uc009vis.3,chr1,-,14969,15038,EXON
uc009vis.3,chr1,-,15038,15795,INTRON
uc009vis.3,chr1,-,15795,15942,EXON
uc009vis.3,chr1,-,15942,16606,INTRON
uc009vis.3,chr1,-,16606,16765,EXON
uc009vit.3,chr1,-,14361,14829,EXON
uc009vit.3,chr1,-,14829,14969,INTRON
(...)
ADD COMMENTlink written 5 weeks ago by Pierre Lindenbaum125k
0
gravatar for vkkodali
5 weeks ago by
vkkodali1.6k
United States
vkkodali1.6k wrote:

Here's a way to extract exon and intron features from GFF3 files in BED format. I am using the RefSeq's latest GRCh37 (hg19) annotation but I'd expect the following code to work (may be with minimal changes) with any other GFF3 file.

I make a distinction between 'all' and 'unique' features as follows. Let's consider the following gene:

GENE            ######################################################
ISOFORM-1       #########--------#######-------####----------#########
                E1.1             E1.2          E1.3          E1.4
                         I1.1           I1.2       I1.3
ISOFORM-2       #########--------###########---------###-----#########
                E2.1             E2.2                E2.3    E2.4
                         I2.1               I2.2        I2.3

In my 'all' output files, each and every exon and intron will be returned. In my 'uniq' output files, identical features are omitted. So, the 'uniq' exon file will include only one of E1.1/E2.1 and E1.4/E2.4, and the uniq intron file will include only one of I1.1/I2.1.

## download GFF3 file
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20190906/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz

## extract all exons
zgrep -v '^#' GCF_000001405.25_GRCh37.p13_genomic.gff.gz \
  | awk 'BEGIN{FS="\t";OFS="\t"}($3=="exon" && $9~/transcript_id=/)' \
  | cut -f1,4,5,7,9 \
  | perl -pe 's/ID.*gene=([^;]*).*transcript_id=([^;]*).*/\1|\2/g' \
  | awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$2-1,$3,$5,"1000",$4}' \
  | sort -k1,1 -k2,2n > hg19_exons.all.bed 

## extract unique exons  
zgrep -v '^#' GCF_000001405.25_GRCh37.p13_genomic.gff.gz \
  | awk 'BEGIN{FS="\t";OFS="\t"}($3=="exon" && $9~/transcript_id=/)' \
  | cut -f1,4,5,7,9 \
  | perl -pe 's/ID.*gene=([^;]*).*/\1/g' \
  | awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$2-1,$3,$5,"1000",$4}' \
  | sort -u | sort -k1,1 -k2,2n > hg19_exons.uniq.bed 

## extract all introns 
zgrep -v '^#' GCF_000001405.25_GRCh37.p13_genomic.gff.gz \
  | awk 'BEGIN{FS="\t";OFS="\t"}($3=="exon" && $9~/transcript_id/)' \
  | perl -pe 's/ID.*gene=([^;]*).*transcript_id=([^;]*).*/\1|\2/g' \
  | cut -f1,4,5,7,9 | sort -k1,1 -k5,5 -k2,2n -k3,3n \
  | awk 'BEGIN{FS="\t";OFS="\t";chr=0;d=0;rs="NA"}{if(chr==$1 && rs==$5){print $1,d,$2-1,rs,0,$4; d=$3}else{chr=$1;rs=$5;d=$3}}' \
  | sort -u | sort -k1,1 -k2,2n > hg19_introns.all.bed 

## extract unique introns 
zgrep -v '^#' GCF_000001405.25_GRCh37.p13_genomic.gff.gz \
  | awk 'BEGIN{FS="\t";OFS="\t"}($3=="exon" && $9~/transcript_id/)' \
  | perl -pe 's/ID.*gene=([^;]*).*transcript_id=([^;]*).*/\1|\2/g' \
  | cut -f1,4,5,7,9 | sort -k1,1 -k5,5 -k2,2n -k3,3n \
  | awk 'BEGIN{FS="\t";OFS="\t";chr=0;d=0;rs="NA"}{if(chr==$1 && rs==$5){print $1,d,$2-1,rs,0,$4; d=$3}else{chr=$1;rs=$5;d=$3}}' \
  | awk 'BEGIN{FS="\t";OFS="\t"}{split($4,a,"|"); $4=a[1]; print $0}' \
  | sort -u | sort -k1,1 -k2,2n > hg19_introns.uniq.bed
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by vkkodali1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1260 users visited in the last hour