Question

Intersect gene IDs list and GFF3 to get the annotation.

0

Entering edit mode

6 weeks ago

Lissa Cruz Saavedra • 0

Hi

I need to get the annotation for a list of IDs (Just one column). It looks like this:

GeneID
Tt.00g000720
Tt.00g000730
Tt.00g000780
Tt.00g000850
Tt.00g000950
Tt.00g001260
Tt.00g001550

I have a gff for those genes IDs like this:

contig_101  GenSAS_65ea36c7c3784-publish    gene    319 1590    .   +   .   ID=Tt.00g000010;Name=Tt.00g000010;original_ID=75803-Tt.00g000010;Alias=75803-Tt.00g000010;original_name=75803-Tt.00g000010;Notes=XP_028877667.1 [BLAST protein vs protein (blastp) 2.12.0],XP_009314224.1 [DIAMOND Functional 2.0.11],L-fucokinase (IPR012887) [InterProScan 5.53-87.0],PF07959.7 [Pfam 1.6]
contig_101  GenSAS_65ea36c7c3784-publish    mRNA    319 1590    .   +   .   ID=Tt.00g000010.m01;Name=Tt.00g000010.m01;Parent=Tt.00g000010;original_ID=75803-Tt.00g000010.m01;Alias=75803-Tt.00g000010.m01;original_name=75803-Tt.00g000010
contig_101  GenSAS_65ea36c7c3784-publish    exon    319 1590    .   +   .   ID=Tt.00g000010.m01.exon01;Name=Tt.00g000010.m01.exon01;Parent=Tt.00g000010.m01;original_ID=75803-Tt.00g000010.m01.exon1;Alias=75803-Tt.00g000010.m01.exon1
contig_101  GenSAS_65ea36c7c3784-publish    CDS 319 1590    .   +   0   ID=Tt.00g000010.m01.CDS01;Name=Tt.00g000010.m01.CDS01;Parent=Tt.00g000010.m01;original_ID=cds.75803-Tt.00g000010.m01;Alias=cds.75803-Tt.00g000010.m01
contig_101  GenSAS_65ea36c7c3784-publish    gene    1726    3468    .   +   .   ID=Tt.00g000020;Name=Tt.00g000020;original_ID=75803-Tt.00g000020;Alias=75803-Tt.00g000020;original_name=75803-Tt.00g000020;Notes=XP_028877667.1 [BLAST protein vs protein (blastp) 2.12.0],XP_028877667.1 [DIAMOND Functional 2.0.11],Galactokinase/homoserine kinase (IPR001174) [InterProScan 5.53-87.0],PF00288.21 [Pfam 1.6]
contig_101  GenSAS_65ea36c7c3784-publish    mRNA    1726    3468    .   +   .   ID=Tt.00g000020.m01;Name=Tt.00g000020.m01;Parent=Tt.00g000020;original_ID=75803-Tt.00g000020.m01;Alias=75803-Tt.00g000020.m01;original_name=75803-Tt.00g000020
contig_101  GenSAS_65ea36c7c3784-publish    exon    1726    3468    .   +   .   ID=Tt.00g000020.m01.exon01;Name=Tt.00g000020.m01.exon01;Parent=Tt.00g000020.m01;original_ID=75803-Tt.00g000020.m01.exon1;Alias=75803-Tt.00g000020.m01.exon1
contig_101  GenSAS_65ea36c7c3784-publish    CDS 1726    3468    .   +   0   ID=Tt.00g000020.m01.CDS01;Name=Tt.00g000020.m01.CDS01;Parent=Tt.00g000020.m01;original_ID=cds.75803-Tt.00g000020.m01;Alias=cds.75803-Tt.00g000020.m01
contig_101  GenSAS_65ea36c7c3784-publish    gene    4054    4560    .   +   .   ID=Tt.00g000030;Name=Tt.00g000030;original_ID=75803-Tt.00g000030;Alias=75803-Tt.00g000030;original_name=75803-Tt.00g000030;Notes=XP_803186.1 [BLAST protein vs protein (blastp) 2.12.0],XP_803186.1 [DIAMOND Functional 2.0.11]
contig_101  GenSAS_65ea36c7c3784-publish    mRNA    4054    4560    .   +   .   ID=Tt.00g000030.m01;Name=Tt.00g000030.m01;Parent=Tt.00g000030;original_ID=75803-Tt.00g000030.m01;Alias=75803-Tt.00g000030.m01;original_name=75803-Tt.00g000030
contig_101  GenSAS_65ea36c7c3784-publish    exon    4054    4560    .   +   .   ID=Tt.00g000030.m01.exon01;Name=Tt.00g000030.m01.exon01;Parent=Tt.00g000030.m01;original_ID=75803-Tt.00g000030.m01.exon1;Alias=75803-Tt.00g000030.m01.exon1
contig_101  GenSAS_65ea36c7c3784-publish    CDS 4054    4560    .   +   0   ID=Tt.00g000030.m01.CDS01;Name=Tt.00g000030.m01.CDS01;Parent=Tt.00g000030.m01;original_ID=cds.75803-Tt.00g000030.m01;Alias=cds.75803-Tt.00g000030.m01
contig_101  GenSAS_65ea36c7c3784-publish    gene    5050    6858    .   +   .   ID=Tt.00g000040;Name=Tt.00g000040;original_ID=75803-Tt.00g000040;Alias=75803-Tt.00g000040;original_name=75803-Tt.00g000040;Notes=XP_803185.1 [BLAST protein vs protein (blastp) 2.12.0],XP_807877.1 [DIAMOND Functional 2.0.11],Pyridoxal phosphate-dependent decarboxylase (IPR002129) [InterProScan 5.53-87.0],PF00282.14 [Pfam 1.6]
contig_101  GenSAS_65ea36c7c3784-publish    mRNA    5050    6858    .   +   .   ID=Tt.00g000040.m01;Name=Tt.00g000040.m01;Parent=Tt.00g000040;original_ID=75803-Tt.00g000040.m01;Alias=75803-Tt.00g000040.m01;original_name=75803-Tt.00g000040
contig_101  GenSAS_65ea36c7c3784-publish    exon    5050    6858    .   +   .   ID=Tt.00g000040.m01.exon01;Name=Tt.00g000040.m01.exon01;Parent=Tt.00g000040.m01;original_ID=75803-Tt.00g000040.m01.exon1;Alias=75803-Tt.00g000040.m01.exon1
contig_101  GenSAS_65ea36c7c3784-publish    CDS 5050    6858    .   +   0   ID=Tt.00g000040.m01.CDS01;Name=Tt.00g000040.m01.CDS01;Parent=Tt.00g000040.m01;original_ID=cds.75803-Tt.00g000040.m01;Alias=cds.75803-Tt.00g000040.m01

How can I intersect the IDs with the GFF file to get the annotation for those IDs?. I have tried it using bedtools intersect but it didn't work because I don't have the annotation for those IDs.

Thank you very much for you help!

annotation GFF • 742 views

ADD COMMENT • link 5 weeks ago by Lissa Cruz Saavedra • 0

1

Entering edit mode

6 weeks ago

Juke34 8.5k

With AGAT you can use agat_sp_filter_feature_from_keep_list.pl you can filter the records related to those IDs:

agat_sp_filter_feature_from_keep_list.pl --gff infile.gff --keep_list file.txt  [ --output outfile ]

Where your file.txt looks like that:

Tt.00g000720
Tt.00g000730
Tt.00g000780
Tt.00g000850
Tt.00g000950
Tt.00g001260
Tt.00g001550

Then I guess agat_convert_sp_gff2tsv.pl should give you a way to explore the annotations

ADD COMMENT • link 6 weeks ago by Juke34 8.5k

0

Entering edit mode

6 weeks ago

b.contreras.moreira ▴ 170

You can also try gffread, the usage instructions below suggest you need parameter --ids :

gffread v0.12.7. Usage:
gffread [-g <genomic_seqs_fasta> | <dir>] [-s <seq_info.fsize>] 
 [-o <outfile>] [-t <trackname>] [-r [<strand>]<chr>:<start>-<end> [-R]]
 [--jmatch <chr>:<start>-<end>] [--no-pseudo] 
 [-CTVNJMKQAFPGUBHZWTOLE] [-w <exons.fa>] [-x <cds.fa>] [-y <tr_cds.fa>]
 [-j ][--ids <IDs.lst> | --nids <IDs.lst>] [--attrs <attr-list>] [-i <maxintron>]
 [--stream] [--bed | --gtf | --tlf] [--table <attrlist>] [--sort-by <ref.lst>]
 [<input_gff>] 

 Filter, convert or cluster GFF/GTF/BED records, extract the sequence of
 transcripts (exon or CDS) and more.
 By default (i.e. without -O) only transcripts are processed, discarding any
 other non-transcript features. Default output is a simplified GFF3 with only
 the basic attributes.

Options:
 --ids discard records/transcripts if their IDs are not listed in <IDs.lst>
 --nids discard records/transcripts if their IDs are listed in <IDs.lst>
 ...

ADD COMMENT • link 6 weeks ago by b.contreras.moreira ▴ 170

score 2 · Accepted Answer · 2024-03-13

2

Entering edit mode

6 weeks ago

Ram 43k

One simple way would be to use grep with a tiny bit of input modification:

grep -Ff <(tail -n +2 id_list.txt | sed -Ee 's/(.*)/ID=\1;/') file.gff

ADD COMMENT • link 6 weeks ago by Ram 43k

0

Entering edit mode

Hey

Thanks for your answer. I tried it but I got an empty output. Could I make something wrong ?

ADD REPLY • link 5 weeks ago by Lissa Cruz Saavedra • 0

0

Entering edit mode

Does your ID list overlap the ID= part of column 9 or a different part? Please provide a better example as there is no overlap between your ID list and any GFF entry.

What is the output to:

grep -m10 Tt.00g000720 file.gff

ADD REPLY • link 5 weeks ago by Ram 43k

0

Entering edit mode

Hi, This is the result This is the results

ADD REPLY • link 5 weeks ago by Lissa Cruz Saavedra • 0

0

Entering edit mode

You should not get empty output. Are you sure you're running the command right? What is the output to:

head id_list.txt | od -c

ADD REPLY • link 5 weeks ago by Ram 43k

0

Entering edit mode

It seems like this

enter image description here

ADD REPLY • link 5 weeks ago by Lissa Cruz Saavedra • 0

0

Entering edit mode

That file was created on Windows, correct? Please run dos2unix on it to make it a proper Linux file, then try the grep from the answer.

ADD REPLY • link 5 weeks ago by Ram 43k

0

Entering edit mode

You were totally right, I change the format of txt created in windows and It worked. Thank you very much for your help!

ADD REPLY • link 5 weeks ago by Lissa Cruz Saavedra • 0