Extract gene coordinates from GTF
3
3
Entering edit mode
7.4 years ago
int11ap1 ▴ 440

Hello,

I have a GTF file with only exon features. There's a way to extract the gene coordinates? Or should I write a script?

  • INPUT: GTF file.
  • OUTPUT: the gene coordinates, whatever the format is.

Thanks.

coordinates gtf • 10k views
ADD COMMENT
0
Entering edit mode

Please make it more clear by showing your Input file and desired output

ADD REPLY
0
Entering edit mode

Done, I cannot be more clear.

ADD REPLY
6
Entering edit mode
7.4 years ago

The question becomes exactly what you want in terms of coordinates for a gene. I'm guessing that you just want the 5' most and 3' most position along with the strand an chromosome, but perhaps you have something else in mind.

Presuming you do want what I mentioned, you could easily do this in R with GenomicFeatures.

library(GenomicFeatures)
txdb <- makeTranscriptDbFromGFF("some_file.gtf", format="gtf")
genes <- genes(txdb)
write.table(as.data.frame(genes)[,-4], file="Just_genes.txt", colnames=F, sep="\t")

The -4 just removes the width column.

ADD COMMENT
3
Entering edit mode

Years later, I would like to make a little update, to maybe save someone 2 minutes: Since some of the last updates the function makeTranscriptDbFromGFF of the package GenomicFeatures is now called makeTxDbFromGFF.

ADD REPLY
0
Entering edit mode

I'm getting this error

Error in write.table(as.data.frame(genes)[, -4], file = "Just_genes.txt",  :

unused argument (colnames = F)

ADD REPLY
1
Entering edit mode

Try again with

write.table(as.data.frame(genes)[,-4], file="Just_genes.txt", col.names=F, sep="\t")
ADD REPLY
0
Entering edit mode

Hi Devon, How would you get the gene description along with the gene coordinates using a similar script as you presented. Thanks

Dave

ADD REPLY
0
Entering edit mode

Hi @Devon Ryan, i have a gff file in this format:

chr11   Gnomon  gene    24482947    24484914    .   -   .   ID=gene26171;Name=LOC103966979;Name=gene26171
chr11   Gnomon  mRNA    24482947    24484914    .   -   .   Parent=gene26171;ID=rna33198
chr11   Gnomon  five_prime_UTR  24484810    24484914    .   -   .   ID=five_prime_UTR:rna33198:1;Parent=rna33198
chr11   Gnomon  start_codon 24484807    24484809    .   -   0   ID=start_codon:rna33198:1;Parent=rna33198
chr11   Gnomon  exon    24484587    24484914    .   -   .   ID=exon:rna33198:1;Parent=rna33198
chr11   Gnomon  exon    24484138    24484445    .   -   .   ID=exon:rna33198:2;Parent=rna33198
chr11   Gnomon  exon    24482947    24483988    .   -   .   ID=exon:rna33198:3;Parent=rna33198
chr11   Gnomon  CDS 24484587    24484809    .   -   0   Parent=rna33198;ID=CDS:rna33198:1
chr11   Gnomon  CDS 24484138    24484445    .   -   2   Parent=rna33198;ID=CDS:rna33198:2
chr11   Gnomon  CDS 24483413    24483988    .   -   0   Parent=rna33198;ID=CDS:rna33198:3
chr11   Gnomon  gene    21571688    21575140    .   -   .   Name=LOC103939934;ID=gene39438;Name=gene39438
chr11   Gnomon  mRNA    21571688    21575140    .   -   .   ID=rna49862;Parent=gene39438
chr11   Gnomon  five_prime_UTR  21575032    21575140    .   -   .   ID=five_prime_UTR:rna49862:1;Parent=rna49862
chr11   Gnomon  five_prime_UTR  21574449    21574449    .   -   .   Parent=rna49862;ID=five_prime_UTR:rna49862:2
chr11   Gnomon  exon    21575032    21575140    .   -   .   ID=exon:rna49862:1;Parent=rna49862
chr11   Gnomon  exon    21574389    21574449    .   -   .   Parent=rna49862;ID=exon:rna49862:2
chr11   Gnomon  exon    21572908    21572989    .   -   .   ID=exon:rna49862:3;Parent=rna49862
chr11   Gnomon  exon    21572290    21572417    .   -   .   ID=exon:rna49862:4;Parent=rna49862
chr11   Gnomon  exon    21571688    21572198    .   -   .   ID=exon:rna49862:5;Parent=rna49862
chr11   Gnomon  start_codon 21574446    21574448    .   -   0   Parent=rna49862;ID=start_codon:rna49862:1
chr11   Gnomon  CDS 21574389    21574448    .   -   0   ID=CDS:rna49862:1;Parent=rna49862
chr11   Gnomon  CDS 21572908    21572989    .   -   0   Parent=rna49862;ID=CDS:rna49862:2
chr11   Gnomon  CDS 21572290    21572417    .   -   2   Parent=rna49862;ID=CDS:rna49862:3
chr11   Gnomon  CDS 21571866    21572198    .   -   0   Parent=rna49862;ID=CDS:rna49862:4

and i have a genes ID:

LOC103966979
LOC103939934

and i want to extract there transcripts info in this format:

chr11   Gnomon  mRNA    24482947    24484914    .   -   .   ID=LOC103966979
chr11   Gnomon  five_prime_UTR  24484810    24484914    .   -   .   ID=five_prime_UTR:rna33198:1;Parent=LOC103966979
chr11   Gnomon  start_codon 24484807    24484809    .   -   0   ID=start_codon:rna33198:1;Parent=LOC103966979
chr11   Gnomon  CDS 24484587    24484809    .   -   0   ID=CDS:rna33198:1;Parent=LOC103966979
chr11   Gnomon  CDS 24484138    24484445    .   -   2   ID=CDS:rna33198:2;Parent=LOC103966979
chr11   Gnomon  CDS 24483413    24483988    .   -   0   ID=CDS:rna33198:3;Parent=LOC103966979
chr11   Gnomon  mRNA    21571688    21575140    .   -   .   ID=LOC103939934
chr11   Gnomon  five_prime_UTR  21575032    21575140    .   -   .   ID=five_prime_UTR:rna49862:1;Parent=LOC103939934
chr11   Gnomon  five_prime_UTR  21574449    21574449    .   -   .   ID=five_prime_UTR:rna49862:2;Parent=LOC103939934
chr11   Gnomon  start_codon 21574446    21574448    .   -   0   ID=start_codon:rna49862:1;Parent=LOC103939934
chr11   Gnomon  CDS 21574389    21574448    .   -   0   ID=CDS:rna49862:1;Parent=LOC103939934
chr11   Gnomon  CDS 21572908    21572989    .   -   0   ID=CDS:rna49862:2;Parent=LOC103939934
chr11   Gnomon  CDS 21572290    21572417    .   -   2   ID=CDS:rna49862:3;Parent=LOC103939934
chr11   Gnomon  CDS 21571866    21572198    .   -   0   ID=CDS:rna49862:4;Parent=LOC103939934

thanks for adivice.

ADD REPLY
0
Entering edit mode

You'd need to script something.

ADD REPLY
5
Entering edit mode
7.4 years ago

using awk and sqlite:

curl -sL "https://rseqflow.googlecode.com/files/mouse_refseq_anno.gtf"   |\
awk -F '    ' 'BEGIN {printf("create temp table T(chrom,start,end,gene); begin transaction;\n");} $3=="exon" {n=split($9,a,/[ ;]+/);for(i=1;i+1< n;i++) if(a[i]=="gene_id") printf("insert into T(chrom,start,end,gene) values (\"%s\",%s,%s,%s);\n",$1,$4,$5,a[i+1]);} END {printf("commit; select chrom,gene,min(start),max(end) from T group by chrom,gene;\n");}' |\
sqlite3 tmp.db
(...)
chrY|Rbm31y|12688110|17402718
chrY|Rbmy1a1|2830680|3783271
chrY|Sly|55213720|75222053
ADD COMMENT
5
Entering edit mode
6.7 years ago

Using gtf2bed:

$ gtf2bed < foo.gtf | cut -f1-3 > foo_coords.bed3

If you want strand information:

$ gtf2bed < foo.gtf | cut -f1-6 > foo_coords.bed6
ADD COMMENT

Login before adding your answer.

Traffic: 1969 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6