intersecting two GFF3 missing data from the second file
1
0
Entering edit mode
4.5 years ago
Ric ▴ 430

Hi, I have both StringTie (stringtie_merged.gtf.fasta.transdecoder.genome.gff3) and BRAKER2 (augustus.hints_utr.gff3) GFF3 files which I would like to intersect with bedtools. Whenever StringTie covers the genome region and BRAKER2 intersect them then I would like to remove BRAKER2 annotation. However, if BRAKER2 is the only in the region then I would like to keep it. Unfortunately, the following commands did not include the last condition as shown in this screenshot e.g. for g83538.t1.

bedtools intersect -wa -a stringtie_merged.gtf.fasta.transdecoder.genome.gff3 -b augustus.hints_utr.gff3 > bedtools.gff3
perl gff3sort.pl --precise   bedtools.gff3 > bedtools.gff3sort.gff

I tried also gff3_sp_complement_annotations.pl from the GAAS package but it kept the BRAKER2 annotation despite overlapping with StringTie e.g. g83533.t1 as shown in this screenshot. How it possible to remove BRAKER2 annotation when it is overlapping with StringTie?

What did I miss or is there a better tools to intersect GFF3 files?

Thank you in advance,

annotation bedtools genome • 1.2k views
ADD COMMENT
0
Entering edit mode

The screenshot is difficult to read. I don't know what is really the problem... I think we have to define more in details what you consider as overlapping.

I don't know for bedtools intersect but agat_sp_complement_annotations.pl considesr two features to be overlapping if:
- they are on the same strand.
- they are of the same type (e.g. mRNA can overlap a tRNA).
- they overlap at CDS level (for mRNA) or at exon level (for other type of feature).

=> Thus two mRNA can overlap (kept in the output) if they are not in the same strand because they are seen as two different locus.
=> Thus two mRNA can overlap (kept in the output) if they overlap in their UTR (because UTR are rarely well defined).
=> One mRNA can overlap another mRNA if their CDS is not overlapping (one can be included in the intron of the other).
=> One mRNA can overlap a tRNA.

So the output reflects these rules. Which rule do you not agree with? We can adapt the script and add extra paramters.

P.S: I see you have run GFF3sort after agat_sp_complement_annotations.plit is not needed. All script from GAAS with the prefix agat_sp provide the same sorting output.

ADD REPLY
0
Entering edit mode
4.5 years ago
Juke34 8.5k

You can use gff3_sp_complement_annotations.pl from the GAAS repository.

gff3_sp_complement_annotations.pl --ref stringtie_merged.gtf.fasta.transdecoder.genome.gff3 --add augustus.hints_utr.gff3

You can use agat_sp_complement_annotations.pl from the AGAT repository.

agat_sp_complement_annotations.pl --ref stringtie_merged.gtf.fasta.transdecoder.genome.gff3 --add augustus.hints_utr.gff3
ADD COMMENT
0
Entering edit mode

Thank you for your explanation. I think the output from your script is correct if mRNA has an opposite strand. However, do you think that mRNA with g83526.t1 might be affected by the UTR rule screenshot.

> grep "g83526.t1" augustus.hints_utr.gff3
NbV1Ch03    AUGUSTUS    mRNA    68949416    68950211    0.13    +   .   ID=g83526.t1;Parent=g83526
NbV1Ch03    AUGUSTUS    transcription_start_site    68949416    68949416    .   +   .   Parent=g83526.t1;
NbV1Ch03    AUGUSTUS    five_prime_utr  68949416    68949853    0.29    +   .   ID=g83526.t1.5UTR1;Parent=g83526.t1
NbV1Ch03    AUGUSTUS    exon    68949416    68949940    .   +   .   ID=g83526.t1.exon1;Parent=g83526.t1;
NbV1Ch03    AUGUSTUS    start_codon 68949854    68949856    .   +   0   Parent=g83526.t1;
NbV1Ch03    AUGUSTUS    CDS 68949854    68949940    0.68    +   0   ID=g83526.t1.CDS1;Parent=g83526.t1
NbV1Ch03    AUGUSTUS    intron  68949941    68950038    0.99    +   .   Parent=g83526.t1;
NbV1Ch03    AUGUSTUS    CDS 68950039    68950164    1   +   0   ID=g83526.t1.CDS2;Parent=g83526.t1
NbV1Ch03    AUGUSTUS    exon    68950039    68950211    .   +   .   ID=g83526.t1.exon2;Parent=g83526.t1;
NbV1Ch03    AUGUSTUS    stop_codon  68950162    68950164    .   +   0   Parent=g83526.t1;
NbV1Ch03    AUGUSTUS    three_prime_utr 68950165    68950211    0.4 +   .   ID=g83526.t1.3UTR1;Parent=g83526.t1
NbV1Ch03    AUGUSTUS    transcription_end_site  68950211    68950211    .   +   .   Parent=g83526.t1; 

> grep "MSTRG.4460.2.p1" stringtie_merged.gtf.fasta.transdecoder.genome.gff3
NbV1Ch03    transdecoder    mRNA    68936050    68950199    .   +   .   ID=MSTRG.4460.2.p1;Parent=MSTRG.4460;Name=ORF%20type%3Acomplete%20len%3A237%20%28%2B%29%2Cscore%3D56.52%2CXP_019243421.1%7C98.3%7C4.0e-133%2CSpermine_synth%7CPF01564.17%7C6.2e-62%2CMethyltransf_25%7CPF13649.6%7C1.1e-05%2CMethyltransf_30%7CPF05430.11%7C0.0009%2CMTS%7CPF05175.14%7C0.004%2CMethyltransf_24%7CPF13578.6%7C0.025%2CMethyltransf_11%7CPF08241.12%7C0.033%2CMethyltransf_12%7CPF08242.12%7C0.13%2CMethyltransf_23%7CPF13489.6%7C0.25%2CMethyltransf_4%7CPF02390.17%7C0.29
NbV1Ch03    transdecoder    five_prime_UTR  68936050    68936280    .   +   .   ID=MSTRG.4460.2.p1.utr5p1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    five_prime_UTR  68936748    68936856    .   +   .   ID=MSTRG.4460.2.p1.utr5p2;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    five_prime_UTR  68937553    68937599    .   +   .   ID=MSTRG.4460.2.p1.utr5p3;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    exon    68936050    68936280    .   +   .   ID=MSTRG.4460.2.p1.exon1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    exon    68936748    68936856    .   +   .   ID=MSTRG.4460.2.p1.exon2;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    exon    68937553    68937807    .   +   .   ID=MSTRG.4460.2.p1.exon3;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    CDS 68937600    68937807    .   +   0   ID=cds.MSTRG.4460.2.p1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    exon    68937906    68937950    .   +   .   ID=MSTRG.4460.2.p1.exon4;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    CDS 68937906    68937950    .   +   2   ID=cds.MSTRG.4460.2.p1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    exon    68938785    68938855    .   +   .   ID=MSTRG.4460.2.p1.exon5;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    CDS 68938785    68938855    .   +   2   ID=cds.MSTRG.4460.2.p1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    exon    68938965    68939087    .   +   .   ID=MSTRG.4460.2.p1.exon6;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    CDS 68938965    68939087    .   +   0   ID=cds.MSTRG.4460.2.p1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    exon    68941190    68941294    .   +   .   ID=MSTRG.4460.2.p1.exon7;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    CDS 68941190    68941294    .   +   0   ID=cds.MSTRG.4460.2.p1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    exon    68941380    68941452    .   +   .   ID=MSTRG.4460.2.p1.exon8;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    CDS 68941380    68941452    .   +   0   ID=cds.MSTRG.4460.2.p1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    exon    68941799    68941870    .   +   .   ID=MSTRG.4460.2.p1.exon9;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    CDS 68941799    68941870    .   +   2   ID=cds.MSTRG.4460.2.p1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    exon    68942256    68942448    .   +   .   ID=MSTRG.4460.2.p1.exon10;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    CDS 68942256    68942448    .   +   2   ID=cds.MSTRG.4460.2.p1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    exon    68942546    68942675    .   +   .   ID=MSTRG.4460.2.p1.exon11;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    CDS 68942546    68942675    .   +   1   ID=cds.MSTRG.4460.2.p1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    exon    68943249    68943509    .   +   .   ID=MSTRG.4460.2.p1.exon12;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    CDS 68943249    68943341    .   +   0   ID=cds.MSTRG.4460.2.p1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    exon    68950039    68950199    .   +   .   ID=MSTRG.4460.2.p1.exon13;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    three_prime_UTR 68943342    68943509    .   +   .   ID=MSTRG.4460.2.p1.utr3p1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    three_prime_UTR 68950039    68950199    .   +   .   ID=MSTRG.4460.2.p1.utr3p2;Parent=MSTRG.4460.2.p1

Thank you in advance,

ADD REPLY
0
Entering edit mode

I don't see your screenshot.
But from the case you show here, they are not overlapping in their CDS, so both are kept and saved as different locus.
Here your data without superfluous information (it makes easy to see that CDS do not overlap):

augustus:

NbV1Ch03    AUGUSTUS    CDS 49854    49940    0.68    +   0   ID=g83526.t1.CDS1;Parent=g83526.t1
NbV1Ch03    AUGUSTUS    CDS 50039    50164    1   +   0   ID=g83526.t1.CDS2;Parent=g83526.t1

stringtie:

NbV1Ch03    transdecoder    CDS 37600    37807    .   +   0   ID=cds.MSTRG.4460.2.p1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    CDS 37906    37950    .   +   2   ID=cds.MSTRG.4460.2.p1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    CDS 38785    38855    .   +   2   ID=cds.MSTRG.4460.2.p1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    CDS 38965    39087    .   +   0   ID=cds.MSTRG.4460.2.p1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    CDS 41190    41294    .   +   0   ID=cds.MSTRG.4460.2.p1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    CDS 41380    41452    .   +   0   ID=cds.MSTRG.4460.2.p1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    CDS 41799    41870    .   +   2   ID=cds.MSTRG.4460.2.p1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    CDS 42256    42448    .   +   2   ID=cds.MSTRG.4460.2.p1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    CDS 42546    42675    .   +   1   ID=cds.MSTRG.4460.2.p1;Parent=MSTRG.4460.2.p1
NbV1Ch03    transdecoder    CDS 43249    43341    .   +   0   ID=cds.MSTRG.4460.2.p1;Parent=MSTRG.4460.2.p1
ADD REPLY
0
Entering edit mode

Thank you for your explanation and for developing this tool.

ADD REPLY

Login before adding your answer.

Traffic: 2389 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6