Annotation lifting to a different organism
2
0
Entering edit mode
4.7 years ago
Ric ▴ 430

Hi, I tried to lift the below TAIR10 annotation with flo:

> head TAIR10_GFF3_genes.gff
Chr1    TAIR10  chromosome  1   30427671    .   .   .   ID=Chr1;Name=Chr1
Chr1    TAIR10  gene    3631    5899    .   +   .   ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1    TAIR10  mRNA    3631    5899    .   +   .   ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1    TAIR10  protein 3760    5630    .   +   .   ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1
Chr1    TAIR10  exon    3631    3913    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  five_prime_UTR  3631    3759    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 3760    3913    .   +   0   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    3996    4276    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 3996    4276    .   +   2   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    4486    4605    .   +   .   Parent=AT1G01010.1

Next, I did

> gff_remove_feats.rb chromosome TAIR10_GFF3_genes.gff > TAIR10_GFF3_genes-fix1.gff |head
Chr1    TAIR10  gene    3631    5899    .   +   .   ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1    TAIR10  mRNA    3631    5899    .   +   .   ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1    TAIR10  protein 3760    5630    .   +   .   ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1
Chr1    TAIR10  exon    3631    3913    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  five_prime_UTR  3631    3759    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 3760    3913    .   +   0   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    3996    4276    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 3996    4276    .   +   2   Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR10  exon    4486    4605    .   +   .   Parent=AT1G01010.1
Chr1    TAIR10  CDS 4486    4605    .   +   0   Parent=AT1G01010.1,AT1G01010.1-Protein;

While running flo I got:

> mkdir run/TAIR10_GFF3_genes-fix1
liftOver -gff /QRISdata/Q0231/flo/tair10/TAIR10_GFF3_genes-fix1.gff run/liftover.chn run/TAIR10_GFF3_genes-fix1/lifted.gff3 run/TAIR10_GFF3_genes-fix1/unlifted.gff3
Reading liftover chains
Mapping coordinates
WARNING: -gff is not recommended.
Use 'ldHgGene -out=<file.gp>' and then 'liftOver -genePred <file.gp>'
/QRISdata/Q0231/apps/flo/gff_recover.rb run/TAIR10_GFF3_genes-fix1/lifted.gff3 2> unprocessed.gff | gt gff3 -tidy -sort -addids -retainids - > run/TAIR10_GFF3_genes-fix1/lifted_cleaned.gff
warning: line 1 in file "-" does not begin with "##gff-version" or "##gvf-version", create "##gff-version 3" line automatically
gt gff3: error: Parent "AT1G64130.1-Protein" on line 3 in file "-" was not defined (via "ID=")
rake aborted!

Additionally, I tried:

> /QRISdata/Q0231/apps/flo/gff_recover.rb run/TAIR10_GFF3_genes-fix1/lifted.gff3 | head
NbV1Ch05    TAIR10  tRNA    45037019    45037091    .   +   .   ID=AT1G01890.1;Parent=AT1G01890;Name=AT1G01890.1;Index=1
NbV1Ch11    TAIR10  tRNA    93127111    93127183    .   -   .   ID=AT1G02480.1;Parent=AT1G02480;Name=AT1G02480.1;Index=1
NbV1Ch02    TAIR10  tRNA    81869336    81869407    .   +   .   ID=AT1G02600.1;Parent=AT1G02600;Name=AT1G02600.1;Index=1
NbV1Ch05    TAIR10  tRNA    97695952    97696024    .   +   .   ID=AT1G02760.1;Parent=AT1G02760;Name=AT1G02760.1;Index=1
NbV1Ch05    TAIR10  tRNA    9913146 9913217 .   -   .   ID=AT1G03515.1;Parent=AT1G03515;Name=AT1G03515.1;Index=1
NbV1Ch13    TAIR10  tRNA    170955340   170955411   .   -   .   ID=AT1G03570.1;Parent=AT1G03570;Name=AT1G03570.1;Index=1
NbV1Ch15    TAIR10  tRNA    91988482    91988554    .   +   .   ID=AT1G03640.1;Parent=AT1G03640;Name=AT1G03640.1;Index=1
NbV1Ch19    TAIR10  tRNA    17742781    17742849    .   +   .   ID=AT1G04320.1;Parent=AT1G04320;Name=AT1G04320.1;Index=1
NbV1Ch15    TAIR10  tRNA    50880103    50880176    .   +   .   ID=AT1G06480.1;Parent=AT1G06480;Name=AT1G06480.1;Index=1
NbV1Ch19    TAIR10  tRNA    5563896 5563968 .   +   .   ID=AT1G06610.1;Parent=AT1G06610;Name=AT1G06610.1;Index=1
...
NbV1Ch05    TAIR10  tRNA    48760991    48761061    .   +   .   ID=AT5G66817.1;Parent=AT5G66817;Name=AT5G66817.1;Index=1
NbV1Ch13    TAIR10  tRNA    39812638    39812709    .   +   .   ID=AT5G67455.1;Parent=AT5G67455;Name=AT5G67455.1;Index=1
NbV1Ch06    TAIR10  gene    24097055    24097111    .   +   .   ID=AT1G64130.1
NbV1Ch06    TAIR10  exon    24097055    24097111    .   +   .   Parent=AT1G64130.1
NbV1Ch06    TAIR10  CDS 24097055    24097111    .   +   0   Parent=AT1G64130.1,AT1G64130.1-Protein
NbV1Ch17    TAIR10  gene    18625243    18625301    .   -   .   ID=AT2G07768.1
NbV1Ch17    TAIR10  exon    18625243    18625301    .   -   .   Parent=AT2G07768.1
NbV1Ch17    TAIR10  CDS 18625243    18625301    .   -   0   Parent=AT2G07768.1,AT2G07768.1-Protein
NbV1Ch17    TAIR10  gene    70101151    70101187    .   -   .   ID=AT5G20570.1
NbV1Ch17    TAIR10  CDS 70101151    70101187    .   -   2   Parent=AT5G20570.1,AT5G20570.1-Protein
NbV1Ch17    TAIR10  gene    70101151    70101187    .   -   .   ID=AT5G20570.2
NbV1Ch17    TAIR10  CDS 70101151    70101187    .   -   2   Parent=AT5G20570.2,AT5G20570.2-Protein

UPDATE I used this Python script provided by Philipp Bayer below:

python3 rm_rubbish.py > TAIR10_GFF3_genes-fix1.rm_rubbish.gff

and rerun flo again but I got this error:

warning: line 1 in file "-" does not begin with "##gff-version" or "##gvf-version", create "##gff-version 3" line automatically
gt gff3: error: Parent "AT1G64130.1-Protein" on line 3 in file "-" was not defined (via "ID=")
rake aborted!
Command failed with status (1): [/QRISdata/Q0231/apps/flo/gff_recover.rb ru...]
/QRISdata/Q0231/apps/flo/Rakefile:60:in `block (2 levels) in <top (required)>'
/QRISdata/Q0231/apps/flo/Rakefile:40:in `each'
/QRISdata/Q0231/apps/flo/Rakefile:40:in `block in <top (required)>'
Tasks: TOP => default
(See full trace by running task with --trace)

Additionally, I did:

/QRISdata/Q0231/apps/flo/gff_recover.rb run/TAIR10_GFF3_genes-fix1.rm_rubbish/lifted.gff3 | head
NbV1Ch03    TAIR10  three_prime_UTR 74669255    74669261    .   -   .   Parent=AT3G06895.1
NbV1Ch02    TAIR10  five_prime_UTR  136562516   136562579   .   -   .   Parent=AT3G13857.1
NbV1Ch06    TAIR10  gene    24097055    24097111    .   +   .   ID=AT1G64130.1
NbV1Ch06    TAIR10  exon    24097055    24097111    .   +   .   Parent=AT1G64130.1
NbV1Ch06    TAIR10  CDS 24097055    24097111    .   +   0   Parent=AT1G64130.1,AT1G64130.1-Protein
NbV1Ch17    TAIR10  gene    18625243    18625301    .   -   .   ID=AT2G07768.1
NbV1Ch17    TAIR10  exon    18625243    18625301    .   -   .   Parent=AT2G07768.1
NbV1Ch17    TAIR10  CDS 18625243    18625301    .   -   0   Parent=AT2G07768.1,AT2G07768.1-Protein
NbV1Ch17    TAIR10  gene    70101151    70101187    .   -   .   ID=AT5G20570.1
NbV1Ch17    TAIR10  CDS 70101151    70101187    .   -   2   Parent=AT5G20570.1,AT5G20570.1-Protein
NbV1Ch17    TAIR10  gene    70101151    70101187    .   -   .   ID=AT5G20570.2
NbV1Ch17    TAIR10  CDS 70101151    70101187    .   -   2   Parent=AT5G20570.2,AT5G20570.2-Protein

What did I miss?

Thank you in advance,

gene assembly • 2.3k views
ADD COMMENT
0
Entering edit mode

What does the output of

/QRISdata/Q0231/apps/flo/gff_recover.rb run/TAIR10_GFF3_genes-fix1/lifted.gff3 | head

say?

ADD REPLY
0
Entering edit mode

I've just added the output into my question.

ADD REPLY
1
Entering edit mode

One error is that TAIR10_GFF3_genes.gff does not seem to be a valid GFF3 file as it is missing the header, see https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md . Not sure what to say about the error gt gff3: error: Parent "AT1G64130.1-Protein" on line 3 in file "-" was not defined (via "ID="). Where did TAIR10_GFF3_genes.gff come from?

ADD REPLY
0
Entering edit mode

The header missing is not an error here, it's just a warning, the tool add one automatically when it is missing.

ADD REPLY
1
Entering edit mode
4.7 years ago

I may have a solution - I've had a similar error recently with an annotation that had many non-gene entries, i.e., snoRNAs, rRNAs, tRNAs etc. and those kept on causing problems in flo. After removing them all with this simple Python 3 script it worked fine (YMMV)

bad_ones = set(['nCRNA','snoRNA','snRNA','pre_miRNA','lnc_RNA','tRNA','rRNA','RNase_MRP','SRP_RNA', 'RNase_MRP_RNA','ncRNA_gene'])

bad_ids = set()
for line in open('Transcripts.gff'):
    if line.startswith('#'):
        print(line, end='')
        continue
    ll = line.split()
    thistype = ll[2]
    names = ll[-1].split(';')
    if thistype in bad_ones:
        badid = names[0].replace('ID=','').split(':')[-1]
        bad_ids.add(badid)
        continue
    if 'Parent=' in names[0]:
        thisparent = names[0].replace('Parent=','').split(':')[-1]
        if thisparent in bad_ids:
            continue
    print(line, end='')

ADD COMMENT
0
Entering edit mode

Thank you but now I got this error gt gff3: error: Parent "AT1G64130.1-Protein" on line 3 in file "-" was not defined (via "ID="). I updated my question above. How to fix it?

ADD REPLY
0
Entering edit mode

The TAIR annotation has the unusual feature that it has "gene" lines with a "Parent" being a protein which is not defined in the file. I recommend filtering Parent= attributes from column 9 for the "gene" lines

ADD REPLY
0
Entering edit mode

How is it possible to filter it?

ADD REPLY
1
Entering edit mode
4.7 years ago
Juke34 8.5k

The error seems coming from the feature protein because it does not contain any Parent attribute. Indeed it contains instead the attribute Derives_from that the tool cannot deals with. I guess you don't need these feature anyway, because uou can retrieve your proteins from CDS. Just remove those features and I guess you will not have anymore the error:

awk '{if($3 != "Protein") print $0}' file.gff > file_clean.gff

ADD COMMENT
0
Entering edit mode

Thank you. However, I still got gt gff3: error: Parent "AT1G64130.1-Protein" on line 3 in file "-". Next, I didawk '{if($3 != "CDS") print $0}' file_clean.gff > file_clean2.gffwas not defined (via "ID="). Unfortunately, I got this error:

gt gff3: error: line 1 in file "run/TAIR10_GFF3_genes-fix1.rm_rubbish_rm_protein_rm_CDS/input.gff" does not begin with "##gff-version" or "##gff-version"
gt extractfeat: error: GFF3 file "-" is empty
file_utils.rb:66:in `block in create_shell_runner': Command failed with status (1): [gt gff3 -sort -retainids run/TAIR10_GFF3_g...] (RuntimeError)
ADD REPLY
0
Entering edit mode

Wihtout the CDS I'm not sure your liftover will succeed. Anyway, the problem after having removed the Protein feature is that the CDS contain in the Parent attribute the link to the ID of the removed Protein feature. So you should fine a way to remove those with e.g a sed command:

From:

 NbV1Ch17    TAIR10  CDS 18625243    18625301    .   -   0   Parent=AT2G07768.1,AT2G07768.1-Protein

you need to have:

NbV1Ch17    TAIR10  CDS 18625243    18625301    .   -   0   Parent=AT2G07768.1
ADD REPLY
0
Entering edit mode

I used the following additionally steps:

> sed 's|,AT.*-Protein;||g' TAIR10_GFF3_genes-fix1.rm_rubbish_rm_protein.gff > TAIR10_GFF3_genes-fix1.rm_rubbish_rm_protein_rm_id.gff
> echo "##gff-version 3" > TAIR10_GFF3_genes-fix1.rm_rubbish_rm_protein_rm_id_rm_index.gff
> sed 's|;Index=1||g' TAIR10_GFF3_genes-fix1.rm_rubbish_rm_protein_rm_id.gff >> TAIR10_GFF3_genes-fix1.rm_rubbish_rm_protein_rm_id_rm_index.gff
> awk '{if($1 != "ChrC" && $1 != "ChrM") print $0}' TAIR10_GFF3_genes-fix1.rm_rubbish_rm_protein_rm_id_rm_index.gff > TAIR10_GFF3_genes-fix1.rm_rubbish_rm_protein_rm_id_rm_index_rm_ChrC_rm_ChrM.gff

> mkdir split-fa
> faSplit byname TAIR10_genome.fas split-fa/
> cat split-fa/Chr*.fa > split-fa/TAIR10_genome-chr1-5.fas

However, I only got 4 genes lifted.

##gff-version   3
##sequence-region   NbV1Ch06 24097055 24097111
##sequence-region   NbV1Ch17 18625243 70101187
NbV1Ch06        TAIR10  gene    24097055        24097111        .       +       .       ID=AT1G64130.1
NbV1Ch06        TAIR10  exon    24097055        24097111        .       +       .       Parent=AT1G64130.1
NbV1Ch06        TAIR10  CDS     24097055        24097111        .       +       0       Parent=AT1G64130.1
###
NbV1Ch17        TAIR10  gene    18625243        18625301        .       -       .       ID=AT2G07768.1
NbV1Ch17        TAIR10  exon    18625243        18625301        .       -       .       Parent=AT2G07768.1
NbV1Ch17        TAIR10  CDS     18625243        18625301        .       -       0       Parent=AT2G07768.1
###
NbV1Ch17        TAIR10  gene    70101151        70101187        .       -       .       ID=AT5G20570.1
NbV1Ch17        TAIR10  CDS     70101151        70101187        .       -       2       Parent=AT5G20570.1
###
NbV1Ch17        TAIR10  gene    70101151        70101187        .       -       .       ID=AT5G20570.2
NbV1Ch17        TAIR10  CDS     70101151        70101187        .       -       2       Parent=AT5G20570.2
###
NbV1Ch17        TAIR10  gene    70101151        70101187        .       -       .       ID=AT5G20570.3
NbV1Ch17        TAIR10  CDS     70101151        70101187        .       -       2       Parent=AT5G20570.3
###

Did I miss anything?

Thank you in advance

ADD REPLY
0
Entering edit mode

You should maybe consider another approach. Have a look here enter link description here table 5.

ADD REPLY

Login before adding your answer.

Traffic: 1745 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6