[Title Edited] How to update the GFF3 wrong file? [was:] featureCounts may be ignore chloroplast sequences in a plant genome when use -t exon and -g ID in a GFF file?
1
0
Entering edit mode
11 months ago
marcelolaia ▴ 10

I run featureCounts like this:

featureCounts -p -F GFF -a ./NCBI/GCF_000612305.1_Egrandis1_0_genomic.gff -t exon -g ID -o ./featureCounts/A1.counts.txt -f --extraAttributes gene ./raw_data/A1.sam.bam


And I extracted exons sequences from whole fasta genome by AGAT like this:

agat_sp_extract_sequences.pl -f ./NCBI/GCF_000612305.1_Egrandis1_0_genomic.fna -gff ./NCBI/GCF_000612305.1_Egrandis1_0_genomic.gff -t exon --split -o ./NCBI/GCF_000612305.1_Egrandis1_0_genomic.exons.fna


AGAT returns a few exons sequence that featureCounts have been ignored.

Please, see a list of them (not considered by featureCounts):

NC_014570.1 RefSeq  gene    539 1600    .   -   .   ID=gene-EucgrC_p001;Dbxref=GeneID:9829619;Name=psbA;gbkey=Gene;gene=psbA;gene_biotype=protein_coding;locus_tag=EucgrC_p001
NC_014570.1 RefSeq  CDS 539 1600    .   -   0   ID=cds-YP_003933945.1;Parent=gene-EucgrC_p001;Dbxref=Genbank:YP_003933945.1,GeneID:9829619;Name=YP_003933945.1;gbkey=CDS;gene=psbA;locus_tag=EucgrC_p001;product=photosystem II protein D1;protein_id=YP_003933945.1;transl_table=11
NC_014570.1 RefSeq  gene    151110  157952  .   -   .   ID=gene-EucgrC_p083;Dbxref=GeneID:9829735;Name=ycf2;gbkey=Gene;gene=ycf2;gene_biotype=protein_coding;locus_tag=EucgrC_p083
NC_014570.1 RefSeq  CDS 151110  157952  .   -   0   ID=cds-YP_003934017.1;Parent=gene-EucgrC_p083;Dbxref=Genbank:YP_003934017.1,GeneID:9829735;Name=YP_003934017.1;gbkey=CDS;gene=ycf2;locus_tag=EucgrC_p083;product=hypothetical chloroplast RF21;protein_id=YP_003934017.1;transl_table=11


However, this ones is considered by featureCounts:

NC_014570.1 RefSeq  gene    12775   14090   .   -   .   ID=gene-EucgrC_p006;Dbxref=GeneID:9829628;Name=atpF;gbkey=Gene;gene=atpF;gene_biotype=protein_coding;locus_tag=EucgrC_p006
NC_014570.1 RefSeq  CDS 13946   14090   .   -   0   ID=cds-YP_003933950.1;Parent=gene-EucgrC_p006;Dbxref=Genbank:YP_003933950.1,GeneID:9829628;Name=YP_003933950.1;gbkey=CDS;gene=atpF;locus_tag=EucgrC_p006;product=ATP synthase CF0 subunit I;protein_id=YP_003933950.1;transl_table=11
NC_014570.1 RefSeq  CDS 12775   13184   .   -   2   ID=cds-YP_003933950.1;Parent=gene-EucgrC_p006;Dbxref=Genbank:YP_003933950.1,GeneID:9829628;Name=YP_003933950.1;gbkey=CDS;gene=atpF;locus_tag=EucgrC_p006;product=ATP synthase CF0 subunit I;protein_id=YP_003933950.1;transl_table=11
NC_014570.1 RefSeq  exon    12778   13245   .   -   .   ID=id-EucgrC_p006-2;Parent=gene-EucgrC_p006;Dbxref=GeneID:9829628;exon_number=2;gbkey=exon;gene=atpF;locus_tag=EucgrC_p006;number=2
NC_014570.1 RefSeq  exon    13932   14090   .   -   .   ID=id-EucgrC_p006-1;Parent=gene-EucgrC_p006;Dbxref=GeneID:9829628;exon_number=1;gbkey=exon;gene=atpF;locus_tag=EucgrC_p006;number=1
NC_014570.1 RefSeq  gene    74329   74442   .   +   .   ID=gene-EucgrC_p044;Dbxref=GeneID:9829682;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=EucgrC_p044;part=1/2
NC_014570.1 RefSeq  gene    145633  146435  .   +   .   ID=gene-EucgrC_p044;Dbxref=GeneID:9829682;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=EucgrC_p044;part=2/2
NC_014570.1 RefSeq  CDS 74329   74442   .   +   0   ID=cds-YP_003933983.1;Parent=gene-EucgrC_p044;Dbxref=Genbank:YP_003933983.1,GeneID:9829682;Name=YP_003933983.1;exception=trans-splicing;gbkey=CDS;gene=rps12;locus_tag=EucgrC_p044;product=ribosomal protein S12;protein_id=YP_003933983.1;transl_table=11
NC_014570.1 RefSeq  CDS 145633  145863  .   +   0   ID=cds-YP_003933983.1;Parent=gene-EucgrC_p044;Dbxref=Genbank:YP_003933983.1,GeneID:9829682;Name=YP_003933983.1;exception=trans-splicing;gbkey=CDS;gene=rps12;locus_tag=EucgrC_p044;product=ribosomal protein S12;protein_id=YP_003933983.1;transl_table=11
NC_014570.1 RefSeq  CDS 146409  146435  .   +   0   ID=cds-YP_003933983.1;Parent=gene-EucgrC_p044;Dbxref=Genbank:YP_003933983.1,GeneID:9829682;Name=YP_003933983.1;exception=trans-splicing;gbkey=CDS;gene=rps12;locus_tag=EucgrC_p044;product=ribosomal protein S12;protein_id=YP_003933983.1;transl_table=11
NC_014570.1 RefSeq  exon    145633  145863  .   +   .   ID=id-EucgrC_p044-2;Parent=gene-EucgrC_p044;Dbxref=GeneID:9829682;exon_number=2;gbkey=exon;gene=rps12;locus_tag=EucgrC_p044;number=2


I could see that those that isn't considered by featureCounts haven't exon in the third column. This is the reason? Is there a way to by pass this? Editing the GFF file?

Thank you!

RNA-Seq featurecounts subread Forum • 433 views
0
Entering edit mode

Is there a way to by pass this? Editing the GFF file?

It is certainly worth a try. The -t exon option specifies the GTF feature type in column 3 of the GTF/GFF3 that featureCounts uses. I suggest you do a small experiment by changing the CDS term in the third column to exon for a couple of genes and see whether it works. Please do post your findings here for the benefit of the community.

4
Entering edit mode
11 months ago
Juke34 ★ 5.3k

It is because agat_sp_extract_sequences.pl was creating missing exons on the fly before to extract any sequence. If you just want to add missing exons and output a gff you can do it with agat_sp_gxf_to_gff3 .pl agat_convert_sp_gxf2gxf.pl.

0
Entering edit mode

Hi, agat_sp_gxf_to_gff3.pl do the trick! However, is it a possible to see all entries it have been updated? For example, it output in console:

Primary tag values (3rd column) not expected => region inverted_repeat sequence_feature cdna_match repeat_region                    -
-                                            Those primary tag are not yet taken into account by the parser!                                             -
-     If you wish to use it/them, pleast update the parameter feature json files accordingly (features_level1, features_level2 or features_level3).      -
-                                                                       To resume:                                                                       -
-                                                   * it must be a level1 feature if it has no parent.                                                   -
-                                    * it must be a level2 feature if it has a parent and this parent is from level1.                                    -
-                                  * it must be a level3 feature if it has a parent and this parent has also a parent.                                   -
-                                                                                                                                                        -
-                Currently the tool just ignore them, So if they where Level1,level2, a gene or RNA feature will be created accordingly.


How I could see how is that incorrect tags? And entries?

Primary tag values (3rd column) not expected => lnc_rna                                                 -
-                     In theory these values are not compatible with gff3 format because they are not part of the Sequence Ontology.                     -
-                                     If you want to follow rigourously the gff3 format, please visit this website:                                      -
-                                      https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md                                       -
-                                                      They provide tools to check the gff3 format.                                                      -
-                            Even if you have this warning, you should be able to use the gff3 output in most of gff3 tools.


Same here.

**********************************
*   Check1: _check_sequential    *
**********************************
We found 2 cases where part of the locus data where defined earlier in the file.
done in 0 seconds


How I could see how is these cases? And entries? To correct it. There are an output resume file?

**********************************
**********************************
We fixed 3160 cases where L2 and L1 features were missing
done in 351 seconds


same here.

**********************************
*   Check4: _remove_orphan_l1    *
**********************************
We removed 10 level1 features that had no subfeature linked to it.
done in 0 seconds


Same here.

**********************************
*      Check5: _check_exons      *
**********************************
_check_exons We modified the L2 RIGHT extremity !
_check_exons We modified the L2 RIGHT extremity !
We create 73 exons that were missing
We modified 7 exons positions that were wrong
done in 19 seconds


Here too.

**********************************
* Check7: _check_all_level2_posi *
*             tions              *
**********************************
We fixed 3 wrong level2 position cases
done in 13 seconds


Here too.

**********************************
* Check8: _check_all_level1_posi *
*             tions              *
**********************************
We fixed 5 wrong level1 position cases
done in 1 seconds


Too.

Thank you!

2
Entering edit mode

Right I have never thought to make a log file resuming all the check results. I should do that. What you can do is increasing the verbosity -v 2 in order to get more information. In newer version you can look at the log file generated, or use the debug mode.

The two first frame are particular.

• The first one tell you what is not yet implemented. You can add those feature types by appending the json files you get when executing agat_sp_gxf_to_gff3 .pl --expose agat_convert_sp_gxf2gxf.pl --expose. But I guess you don't really need them for your analysis anyway.

• The second one is just to inform you about feature types handle by the parser that are not officially accepted by the GFF3 format (i.e part of the ontology either defined in the header of the gff file if any, or the one provided within AGAT.).

I don't know how has been done this gff file but there is plenty of weirdness.