Question: Annotation lifting to a different organism
0
gravatar for Ric
19 days ago by
Ric250
Australia
Ric250 wrote:

Hi, I tried to lift the below TAIR10 annotation with flo:

> head TAIR10_GFF3_genes.gff
Chr1    TAIR10  chromosome  1   30427671    .   .   .   ID=Chr1;Name=Chr1
Chr1    TAIR10  gene    3631    5899    .   +   .   ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1    TAIR10  mRNA    3631    5899    .   +   .   ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1    TAIR10  protein 3760    5630    .   +   .   ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1
Chr1    TAIR10  exon    3631    3913    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  five_prime_UTR  3631    3759    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 3760    3913    .   +   0   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    3996    4276    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 3996    4276    .   +   2   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    4486    4605    .   +   .   Parent=AT1G01010.1

Next, I did

> gff_remove_feats.rb chromosome TAIR10_GFF3_genes.gff > TAIR10_GFF3_genes-fix1.gff |head
Chr1    TAIR10  gene    3631    5899    .   +   .   ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1    TAIR10  mRNA    3631    5899    .   +   .   ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1    TAIR10  protein 3760    5630    .   +   .   ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1
Chr1    TAIR10  exon    3631    3913    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  five_prime_UTR  3631    3759    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 3760    3913    .   +   0   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    3996    4276    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 3996    4276    .   +   2   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    4486    4605    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 4486    4605    .   +   0   Parent=AT1G01010.1,AT1G01010.1-Protein;

While running flo I got:

> mkdir run/TAIR10_GFF3_genes-fix1
liftOver -gff /QRISdata/Q0231/flo/tair10/TAIR10_GFF3_genes-fix1.gff run/liftover.chn run/TAIR10_GFF3_genes-fix1/lifted.gff3 run/TAIR10_GFF3_genes-fix1/unlifted.gff3
Reading liftover chains
Mapping coordinates
WARNING: -gff is not recommended.
Use 'ldHgGene -out=<file.gp>' and then 'liftOver -genePred <file.gp>'
/QRISdata/Q0231/apps/flo/gff_recover.rb run/TAIR10_GFF3_genes-fix1/lifted.gff3 2> unprocessed.gff | gt gff3 -tidy -sort -addids -retainids - > run/TAIR10_GFF3_genes-fix1/lifted_cleaned.gff
warning: line 1 in file "-" does not begin with "##gff-version" or "##gvf-version", create "##gff-version 3" line automatically
gt gff3: error: Parent "AT1G64130.1-Protein" on line 3 in file "-" was not defined (via "ID=")
rake aborted!

Additionally, I tried:

> /QRISdata/Q0231/apps/flo/gff_recover.rb run/TAIR10_GFF3_genes-fix1/lifted.gff3 | head
NbV1Ch05    TAIR10  tRNA    45037019    45037091    .   +   .   ID=AT1G01890.1;Parent=AT1G01890;Name=AT1G01890.1;Index=1
NbV1Ch11    TAIR10  tRNA    93127111    93127183    .   -   .   ID=AT1G02480.1;Parent=AT1G02480;Name=AT1G02480.1;Index=1
NbV1Ch02    TAIR10  tRNA    81869336    81869407    .   +   .   ID=AT1G02600.1;Parent=AT1G02600;Name=AT1G02600.1;Index=1
NbV1Ch05    TAIR10  tRNA    97695952    97696024    .   +   .   ID=AT1G02760.1;Parent=AT1G02760;Name=AT1G02760.1;Index=1
NbV1Ch05    TAIR10  tRNA    9913146 9913217 .   -   .   ID=AT1G03515.1;Parent=AT1G03515;Name=AT1G03515.1;Index=1
NbV1Ch13    TAIR10  tRNA    170955340   170955411   .   -   .   ID=AT1G03570.1;Parent=AT1G03570;Name=AT1G03570.1;Index=1
NbV1Ch15    TAIR10  tRNA    91988482    91988554    .   +   .   ID=AT1G03640.1;Parent=AT1G03640;Name=AT1G03640.1;Index=1
NbV1Ch19    TAIR10  tRNA    17742781    17742849    .   +   .   ID=AT1G04320.1;Parent=AT1G04320;Name=AT1G04320.1;Index=1
NbV1Ch15    TAIR10  tRNA    50880103    50880176    .   +   .   ID=AT1G06480.1;Parent=AT1G06480;Name=AT1G06480.1;Index=1
NbV1Ch19    TAIR10  tRNA    5563896 5563968 .   +   .   ID=AT1G06610.1;Parent=AT1G06610;Name=AT1G06610.1;Index=1
...
NbV1Ch05    TAIR10  tRNA    48760991    48761061    .   +   .   ID=AT5G66817.1;Parent=AT5G66817;Name=AT5G66817.1;Index=1
NbV1Ch13    TAIR10  tRNA    39812638    39812709    .   +   .   ID=AT5G67455.1;Parent=AT5G67455;Name=AT5G67455.1;Index=1
NbV1Ch06    TAIR10  gene    24097055    24097111    .   +   .   ID=AT1G64130.1
NbV1Ch06    TAIR10  exon    24097055    24097111    .   +   .   Parent=AT1G64130.1
NbV1Ch06    TAIR10  CDS 24097055    24097111    .   +   0   Parent=AT1G64130.1,AT1G64130.1-Protein
NbV1Ch17    TAIR10  gene    18625243    18625301    .   -   .   ID=AT2G07768.1
NbV1Ch17    TAIR10  exon    18625243    18625301    .   -   .   Parent=AT2G07768.1
NbV1Ch17    TAIR10  CDS 18625243    18625301    .   -   0   Parent=AT2G07768.1,AT2G07768.1-Protein
NbV1Ch17    TAIR10  gene    70101151    70101187    .   -   .   ID=AT5G20570.1
NbV1Ch17    TAIR10  CDS 70101151    70101187    .   -   2   Parent=AT5G20570.1,AT5G20570.1-Protein
NbV1Ch17    TAIR10  gene    70101151    70101187    .   -   .   ID=AT5G20570.2
NbV1Ch17    TAIR10  CDS 70101151    70101187    .   -   2   Parent=AT5G20570.2,AT5G20570.2-Protein

What did I miss?

Thank you in advance,

assembly gene • 129 views
ADD COMMENTlink modified 15 days ago by Philipp Bayer6.4k • written 19 days ago by Ric250

What does the output of

/QRISdata/Q0231/apps/flo/gff_recover.rb run/TAIR10_GFF3_genes-fix1/lifted.gff3 | head

say?

ADD REPLYlink written 19 days ago by Philipp Bayer6.4k

I've just added the output into my question.

ADD REPLYlink written 15 days ago by Ric250
1

One error is that TAIR10_GFF3_genes.gff does not seem to be a valid GFF3 file as it is missing the header, see https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md . Not sure what to say about the error gt gff3: error: Parent "AT1G64130.1-Protein" on line 3 in file "-" was not defined (via "ID="). Where did TAIR10_GFF3_genes.gff come from?

ADD REPLYlink written 15 days ago by jean.elbers1.2k
1
gravatar for Philipp Bayer
15 days ago by
Philipp Bayer6.4k
Australia/Perth/UWA
Philipp Bayer6.4k wrote:

I may have a solution - I've had a similar error recently with an annotation that had many non-gene entries, i.e., snoRNAs, rRNAs, tRNAs etc. and those kept on causing problems in flo. After removing them all with this simple Python 3 script it worked fine (YMMV)

bad_ones = set(['nCRNA','snoRNA','snRNA','pre_miRNA','lnc_RNA','tRNA','rRNA','RNase_MRP','SRP_RNA', 'RNase_MRP_RNA','ncRNA_gene'])

bad_ids = set()
for line in open('Transcripts.gff'):
    if line.startswith('#'):
        print(line, end='')
        continue
    ll = line.split()
    thistype = ll[2]
    names = ll[-1].split(';')
    if thistype in bad_ones:
        badid = names[0].replace('ID=','').split(':')[-1]
        bad_ids.add(badid)
        continue
    if 'Parent=' in names[0]:
        thisparent = names[0].replace('Parent=','').split(':')[-1]
        if thisparent in bad_ids:
            continue
    print(line, end='')

ADD COMMENTlink modified 15 days ago • written 15 days ago by Philipp Bayer6.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1199 users visited in the last hour