Extract transcript ID and gene ID from ITAG4.1_gene_models.gff
1
0
Entering edit mode
11 months ago
Tian • 0

Hello all, I was hoping to extract the transcript ID and corresponding gene ID from ITAG4.1_gene_models.gff (downloaded from https://solgenomics.net/ftp/genomes/Solanum_lycopersicum/annotation/ITAG4.1_release/) using R. I have tried different methods:

First method:

List <- tr2g_gff3(file = directory, write_tr2g = FALSE, get_transcriptome = FALSE, save_filtered_gff = FALSE) 

Error in check_tag_present(c(transcript_id, gene_id), tags, error = TRUE) : 
  Tags transcript_id, gene_id are absent from the attribute field.

Second method:

gr <- read_gff("ITAG4.1_gene_models.gff") %>% select(gene_id, gene_name, transcript_id)

Error in select_rng(.data, .drop_ranges, ...) : 
  Cannot select/rename the following columns: seqnames, start, end, width, strand

Can anyone please help? Thank you very much! Tian

Genome Tomato • 627 views
ADD COMMENT
2
Entering edit mode
11 months ago
Meisam ▴ 230

gff files does not have gene_id or gene_name columns that you can subset with dplyr::select(), always check your data format. For your particular case your gff file looks like this:

 head(ITAG4.1)
      seqid source type start   end score strand phase                                                                                                                                                                                                attributes
1 SL4.0ch00  maker gene 83863 84177    NA      +  <NA>                                                                                                                                                            ID=gene:Solyc00g160260.1;Name=Solyc00g160260.1
2 SL4.0ch00  maker mRNA 83863 84177    NA      +  <NA> ID=mRNA:Solyc00g160260.1.1;Parent=gene:Solyc00g160260.1;Name=Solyc00g160260.1.1;_aed=0.30;_eaed=0.59;_qi=0|0|0|1|0|0|2|0|100;Note=Homeobox leucine-zipper protein (AHRD V3.11 *-* tr|Q8H963|Q8H963_ZINVI)
3 SL4.0ch00  maker exon 83863 84043    NA      +  <NA>                                                                                                                                               ID=exon:Solyc00g160260.1.1.1;Parent=mRNA:Solyc00g160260.1.1
4 SL4.0ch00  maker  CDS 83863 84043    NA      +     0                                                                                                                                                ID=CDS:Solyc00g160260.1.1.1;Parent=mRNA:Solyc00g160260.1.1
5 SL4.0ch00  maker exon 84056 84177    NA      +  <NA>                                                                                                                                               ID=exon:Solyc00g160260.1.1.2;Parent=mRNA:Solyc00g160260.1.1
6 SL4.0ch00  maker  CDS 84056 84177    NA      +     2                                                                                                                                                ID=CDS:Solyc00g160260.1.1.2;Parent=mRNA:Solyc00g160260.1.1

So both your gene and transcript id information are stored in the last column, "attributes". I am not sure what you mean by "extract" in this case, you just want a list of gene names or transcripts names for example? In that case you can first filter your genes or transcripts (with the "type" variable) then use a regular expression to retrieve the attribute of choice you are interested in.

ADD COMMENT

Login before adding your answer.

Traffic: 1682 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6