Question

Extract transcript ID and gene ID from ITAG4.1_gene_models.gff

0

Entering edit mode

11 months ago

Tian • 0

Hello all, I was hoping to extract the transcript ID and corresponding gene ID from ITAG4.1_gene_models.gff (downloaded from https://solgenomics.net/ftp/genomes/Solanum_lycopersicum/annotation/ITAG4.1_release/) using R. I have tried different methods:

First method:

List <- tr2g_gff3(file = directory, write_tr2g = FALSE, get_transcriptome = FALSE, save_filtered_gff = FALSE) 

Error in check_tag_present(c(transcript_id, gene_id), tags, error = TRUE) : 
  Tags transcript_id, gene_id are absent from the attribute field.

Second method:

gr <- read_gff("ITAG4.1_gene_models.gff") %>% select(gene_id, gene_name, transcript_id)

Error in select_rng(.data, .drop_ranges, ...) : 
  Cannot select/rename the following columns: seqnames, start, end, width, strand

Can anyone please help? Thank you very much! Tian

Genome Tomato • 627 views

ADD COMMENT • link updated 11 months ago by GenoMax 141k • written 11 months ago by Tian • 0

score 2 · Accepted Answer · 2023-05-19

gff files does not have gene_id or gene_name columns that you can subset with dplyr::select(), always check your data format. For your particular case your gff file looks like this:

 head(ITAG4.1)
      seqid source type start   end score strand phase                                                                                                                                                                                                attributes
1 SL4.0ch00  maker gene 83863 84177    NA      +  <NA>                                                                                                                                                            ID=gene:Solyc00g160260.1;Name=Solyc00g160260.1
2 SL4.0ch00  maker mRNA 83863 84177    NA      +  <NA> ID=mRNA:Solyc00g160260.1.1;Parent=gene:Solyc00g160260.1;Name=Solyc00g160260.1.1;_aed=0.30;_eaed=0.59;_qi=0|0|0|1|0|0|2|0|100;Note=Homeobox leucine-zipper protein (AHRD V3.11 *-* tr|Q8H963|Q8H963_ZINVI)
3 SL4.0ch00  maker exon 83863 84043    NA      +  <NA>                                                                                                                                               ID=exon:Solyc00g160260.1.1.1;Parent=mRNA:Solyc00g160260.1.1
4 SL4.0ch00  maker  CDS 83863 84043    NA      +     0                                                                                                                                                ID=CDS:Solyc00g160260.1.1.1;Parent=mRNA:Solyc00g160260.1.1
5 SL4.0ch00  maker exon 84056 84177    NA      +  <NA>                                                                                                                                               ID=exon:Solyc00g160260.1.1.2;Parent=mRNA:Solyc00g160260.1.1
6 SL4.0ch00  maker  CDS 84056 84177    NA      +     2                                                                                                                                                ID=CDS:Solyc00g160260.1.1.2;Parent=mRNA:Solyc00g160260.1.1

So both your gene and transcript id information are stored in the last column, "attributes". I am not sure what you mean by "extract" in this case, you just want a list of gene names or transcripts names for example? In that case you can first filter your genes or transcripts (with the "type" variable) then use a regular expression to retrieve the attribute of choice you are interested in.