Hello,
I have a GTF file with only exon features. There's a way to extract the gene coordinates? Or should I write a script?
- INPUT: GTF file.
- OUTPUT: the gene coordinates, whatever the format is.
Thanks.
Hello,
I have a GTF file with only exon features. There's a way to extract the gene coordinates? Or should I write a script?
Thanks.
The question becomes exactly what you want in terms of coordinates for a gene. I'm guessing that you just want the 5' most and 3' most position along with the strand an chromosome, but perhaps you have something else in mind.
Presuming you do want what I mentioned, you could easily do this in R with GenomicFeatures.
library(GenomicFeatures)
txdb <- makeTranscriptDbFromGFF("some_file.gtf", format="gtf")
genes <- genes(txdb)
write.table(as.data.frame(genes)[,-4], file="Just_genes.txt", colnames=F, sep="\t")
The -4 just removes the width column.
Hi @Devon Ryan, i have a gff file in this format:
chr11   Gnomon  gene    24482947    24484914    .   -   .   ID=gene26171;Name=LOC103966979;Name=gene26171
chr11   Gnomon  mRNA    24482947    24484914    .   -   .   Parent=gene26171;ID=rna33198
chr11   Gnomon  five_prime_UTR  24484810    24484914    .   -   .   ID=five_prime_UTR:rna33198:1;Parent=rna33198
chr11   Gnomon  start_codon 24484807    24484809    .   -   0   ID=start_codon:rna33198:1;Parent=rna33198
chr11   Gnomon  exon    24484587    24484914    .   -   .   ID=exon:rna33198:1;Parent=rna33198
chr11   Gnomon  exon    24484138    24484445    .   -   .   ID=exon:rna33198:2;Parent=rna33198
chr11   Gnomon  exon    24482947    24483988    .   -   .   ID=exon:rna33198:3;Parent=rna33198
chr11   Gnomon  CDS 24484587    24484809    .   -   0   Parent=rna33198;ID=CDS:rna33198:1
chr11   Gnomon  CDS 24484138    24484445    .   -   2   Parent=rna33198;ID=CDS:rna33198:2
chr11   Gnomon  CDS 24483413    24483988    .   -   0   Parent=rna33198;ID=CDS:rna33198:3
chr11   Gnomon  gene    21571688    21575140    .   -   .   Name=LOC103939934;ID=gene39438;Name=gene39438
chr11   Gnomon  mRNA    21571688    21575140    .   -   .   ID=rna49862;Parent=gene39438
chr11   Gnomon  five_prime_UTR  21575032    21575140    .   -   .   ID=five_prime_UTR:rna49862:1;Parent=rna49862
chr11   Gnomon  five_prime_UTR  21574449    21574449    .   -   .   Parent=rna49862;ID=five_prime_UTR:rna49862:2
chr11   Gnomon  exon    21575032    21575140    .   -   .   ID=exon:rna49862:1;Parent=rna49862
chr11   Gnomon  exon    21574389    21574449    .   -   .   Parent=rna49862;ID=exon:rna49862:2
chr11   Gnomon  exon    21572908    21572989    .   -   .   ID=exon:rna49862:3;Parent=rna49862
chr11   Gnomon  exon    21572290    21572417    .   -   .   ID=exon:rna49862:4;Parent=rna49862
chr11   Gnomon  exon    21571688    21572198    .   -   .   ID=exon:rna49862:5;Parent=rna49862
chr11   Gnomon  start_codon 21574446    21574448    .   -   0   Parent=rna49862;ID=start_codon:rna49862:1
chr11   Gnomon  CDS 21574389    21574448    .   -   0   ID=CDS:rna49862:1;Parent=rna49862
chr11   Gnomon  CDS 21572908    21572989    .   -   0   Parent=rna49862;ID=CDS:rna49862:2
chr11   Gnomon  CDS 21572290    21572417    .   -   2   Parent=rna49862;ID=CDS:rna49862:3
chr11   Gnomon  CDS 21571866    21572198    .   -   0   Parent=rna49862;ID=CDS:rna49862:4
and i have a genes ID:
LOC103966979
LOC103939934
and i want to extract there transcripts info in this format:
chr11   Gnomon  mRNA    24482947    24484914    .   -   .   ID=LOC103966979
chr11   Gnomon  five_prime_UTR  24484810    24484914    .   -   .   ID=five_prime_UTR:rna33198:1;Parent=LOC103966979
chr11   Gnomon  start_codon 24484807    24484809    .   -   0   ID=start_codon:rna33198:1;Parent=LOC103966979
chr11   Gnomon  CDS 24484587    24484809    .   -   0   ID=CDS:rna33198:1;Parent=LOC103966979
chr11   Gnomon  CDS 24484138    24484445    .   -   2   ID=CDS:rna33198:2;Parent=LOC103966979
chr11   Gnomon  CDS 24483413    24483988    .   -   0   ID=CDS:rna33198:3;Parent=LOC103966979
chr11   Gnomon  mRNA    21571688    21575140    .   -   .   ID=LOC103939934
chr11   Gnomon  five_prime_UTR  21575032    21575140    .   -   .   ID=five_prime_UTR:rna49862:1;Parent=LOC103939934
chr11   Gnomon  five_prime_UTR  21574449    21574449    .   -   .   ID=five_prime_UTR:rna49862:2;Parent=LOC103939934
chr11   Gnomon  start_codon 21574446    21574448    .   -   0   ID=start_codon:rna49862:1;Parent=LOC103939934
chr11   Gnomon  CDS 21574389    21574448    .   -   0   ID=CDS:rna49862:1;Parent=LOC103939934
chr11   Gnomon  CDS 21572908    21572989    .   -   0   ID=CDS:rna49862:2;Parent=LOC103939934
chr11   Gnomon  CDS 21572290    21572417    .   -   2   ID=CDS:rna49862:3;Parent=LOC103939934
chr11   Gnomon  CDS 21571866    21572198    .   -   0   ID=CDS:rna49862:4;Parent=LOC103939934
thanks for adivice.
Using gtf2bed:
$ gtf2bed < foo.gtf | cut -f1-3 > foo_coords.bed3
If you want strand information:
$ gtf2bed < foo.gtf | cut -f1-6 > foo_coords.bed6
using awk and sqlite:
curl -sL "https://rseqflow.googlecode.com/files/mouse_refseq_anno.gtf"   |\
awk -F '    ' 'BEGIN {printf("create temp table T(chrom,start,end,gene); begin transaction;\n");} $3=="exon" {n=split($9,a,/[ ;]+/);for(i=1;i+1< n;i++) if(a[i]=="gene_id") printf("insert into T(chrom,start,end,gene) values (\"%s\",%s,%s,%s);\n",$1,$4,$5,a[i+1]);} END {printf("commit; select chrom,gene,min(start),max(end) from T group by chrom,gene;\n");}' |\
sqlite3 tmp.db
(...)
chrY|Rbm31y|12688110|17402718
chrY|Rbmy1a1|2830680|3783271
chrY|Sly|55213720|75222053
                    
                
                Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Please make it more clear by showing your Input file and desired output
Done, I cannot be more clear.