Question

Different length returned by featureCounts and seqtk comp and AGAT

0

Entering edit mode

4.0 years ago

marcelolaia ▴ 10

AGAT, seqtk comp and featureCounts return different length for same fragment.

Here is the command line I used:

agat_sp_extract_sequences.pl -f ./NCBI/GCF_000612305.1_Egrandis1_0_genomic.fna -gff ./NCBI/GCF_000612305.1_Egrandis1_0_genomic.gff -t exon --split -o ./NCBI/GCF_000612305.1_Egrandis1_0_genomic.exons.fna
seqtk comp ./NCBI/GCF_000612305.1_Egrandis1_0_genomic.exons.fna > ./NCBI/GCF_000612305.1_Egrandis1_0_genomic.exons.GC.fna
featureCounts -p -F GFF -a ./NCBI/GCF_000612305.1_Egrandis1_0_genomic.gff -t exon -g ID -o ./featureCounts/A1.counts.txt -f --extraAttributes gene ./raw_data/A1.sam.bam

Here is the relevant part of files I have been used.

NCBI/GCF_000612305.1_Egrandis1_0_genomic.fna

Whole file.

GCF_000612305.1_Egrandis1_0_genomic.gff

NC_014570.1 RefSeq  exon    12778   13245   .   -   .   ID=id-EucgrC_p006-2;Parent=gene-EucgrC_p006;Dbxref=GeneID:9829628;exon_number=2;gbkey=exon;gene=atpF;locus_tag=EucgrC_p006;number=2
NC_014570.1 RefSeq  exon    47867   47974   .   -   .   ID=id-EucgrC_p021-1;Parent=gene-EucgrC_p021;Dbxref=GeneID:9829651;exon_number=1;gbkey=exon;gene=ycf3;locus_tag=EucgrC_p021;number=1
NC_014570.1 RefSeq  exon    46882   47109   .   -   .   ID=id-EucgrC_p021-2;Parent=gene-EucgrC_p021;Dbxref=GeneID:9829651;exon_number=2;gbkey=exon;gene=ycf3;locus_tag=EucgrC_p021;number=2
NC_014570.1 RefSeq  exon    45999   46148   .   -   .   ID=id-EucgrC_p021-3;Parent=gene-EucgrC_p021;Dbxref=GeneID:9829651;exon_number=3;gbkey=exon;gene=ycf3;locus_tag=EucgrC_p021;number=3
NC_014570.1 RefSeq  exon    74612   74836   .   -   .   ID=id-EucgrC_p046-3;Parent=gene-EucgrC_p046;Dbxref=GeneID:9829683;exon_number=3;gbkey=exon;gene=clpP;locus_tag=EucgrC_p046;number=3
NC_014570.1 RefSeq  exon    99515   100267  .   -   .   ID=id-EucgrC_p066-2;Parent=gene-EucgrC_p066;Dbxref=GeneID:9829705;exon_number=2;gbkey=exon;gene=ndhB;locus_tag=EucgrC_p066;number=2
NC_014570.1 RefSeq  exon    148743  149495  .   +   .   ID=id-EucgrC_p081-2;Parent=gene-EucgrC_p081;Dbxref=GeneID:9829732;exon_number=2;gbkey=exon;gene=ndhB;locus_tag=EucgrC_p081;number=2

GCF_000612305.1_Egrandis1_0_genomic.exons.fna

>id-EucgrC_p006-2 transcript=gene-EucgrC_p006 gene=nbis-gene-8 name=atpF seq_id=NC_014570.1 type=exon
GTTTGGTTCGGGAAGGGATTATGGAAGTTTTGAAATGAATGGAAAGATAATCTACTTTCA
TTAAGCGATTTATTAGATAATCGAAAACAGAGGATCTTGAATACTATTCGAAATTCAGAA
GAATTACGCGGCGGGGCCATTGAACAGCTGGAAAAAGCCCGGGCCCGTTTACGGAAAGTG
GAAATGGAAGCGGAGCAGTTTCGAGTGAATGGATATTCTGAGATAGAACAAGAAAAGTTG
AATCTGATTAATTCAACTTATAAGACCTTGGAACAATTAGAAAATTACAAAAACGAAACT
ATTCATTTTGAACAGCAAAGAGCGATTAATCAAGTCCGACAACGGGTTTTCCAACAAGCC
TTACAAGGAGCTCTAGGAACTCTGAATAGTTGTTTGAACAACGAGTTACATTTACGTACT
ATCAGTGCTAATATTGGCATGTTTGGGGCGATGAAAGAAATAACTGATTAG

>id-EucgrC_p021-1 transcript=gene-EucgrC_p021 gene=nbis-gene-21 name=ycf3 seq_id=NC_014570.1 type=exon
ATGCCTAGATCGAGGATAAATGGAAATTTTATTGATAAGACCTTTTCAATTGTAGCCAAT
ATCTTATTACGAATAATTCCGACAACTTCAGGAGAAAAAGAGGCATTTACCTATTACAGA
GATGGT

>id-EucgrC_p021-2 transcript=gene-EucgrC_p021 gene=nbis-gene-21 name=ycf3 seq_id=NC_014570.1 type=exon
GGATGTCGGCTCAATCCGAAGGAAATTATGCGGAAGCTTTACAGAATTATTATGAAGCTA
TGCGACTAGAAATTGATCCCTATGATCGAAGCTATATACTCTATAACATAGGCCTTATCC
ATACAAGTAACGGAGAACATACGAAAGCTTTGGAATATTATTTTCGGGCACTAGAACGAA
ACCCGTTTTTACCACAAGCTTTTAATAATATGGCTGTGATCTGTCATTAC

>id-EucgrC_p021-3 transcript=gene-EucgrC_p021 gene=nbis-gene-21 name=ycf3 seq_id=NC_014570.1 type=exon
CGGGGAGAGCAGGCCATTCGACAGGGGGATTCTGAAATTGCGGAGGCTTGGTTCGATCAA
GCCGCTGAGTATTGGAAACAAGCTATAGCGCTTACTCCTGGTAATTATATTGAAGCGCAG
AATTGGTTGAAGATCACGAGACGTTTCGAATAA

>id-EucgrC_p046-3 transcript=gene-EucgrC_p046 gene=nbis-gene-43 name=clpP seq_id=NC_014570.1 type=exon
AGGGTAATGATACATCAACCTGCGAGTTCTTTTTATGAGGCACAAACGGGAGAATTTATC
CTGGAAGCAGAAGAACTGCTGAAACTGCGCGAAACCATCACAAGAGTTTATGTACAAAGA
ACGGGCAAACCCCTATGGGTTGTATCCGAAGATATGGAAAGGGATGTTTTTATGTCAGCA
ACAGAAGCCCAAGCTCATGGAATTGTTGATCTTGTAGCGATTGAATAA

>id-EucgrC_p066-2 transcript=gene-EucgrC_p066 gene=nbis-gene-61 name=ndhB seq_id=NC_014570.1 type=exon
TCTCCCACTCCAGTCGTTGCTTTTCTTTCTGTTACTTCGAAAGTAGCTGCTTCAGCTTCA
GCCACTCGAATTTTCGATATTCCTTTTTATTTCTCATCAAACGAATGGCATCTTCTTCTG
GAAATCCTAGCTATTCTTAGCATGATATTGGGGAATCTCATTGCTATTACTCAAACAAGC
ATGAAACGTATGCTTGCATATTCGTCCATAGGTCAAATCGGATATGTAATTATTGGAATA
ATTGTTGGAGACTCAAATGGTGGATATGCGAGCATGATAACTTATATGCTGTTCTATATC
TCCATGAATCTAGGAACTTTTGCTTGCATTGTATTATTTGGTCTACGTACCGGAACTGAT
AACATTCGAGATTATGCAGGATTATACACGAAAGATCCTTTTTTGGCTCTCTCTTTAGCC
CTATGTCTCTTATCCCTAGGAGGTCTTCCTCCACTAGCAGGTTTTTTCGGAAAACTCCAT
TTATTCTGGTGTGGATGGCAGGCAGGCCTATACTTCTTGGTTTCAATAGGACTCCTTACG
AGCGTTATTTCTATCTACTATTATCTAAAAATAATCAAGTTATTAATGACTGGACGAAAC
CAAGAAATAACACCTCACGTGCGAAATTATAGAAGATCCCCTTTAAGATCAAACAATTCC
ATCGAATTGAGTATGATTGTATGTGTGATAGCATCTACTATACCAGGAATATCAATGAAC
CCGATTATTGCAATTGCTCAGGATACCCTTTTTTAG

>id-EucgrC_p081-2 transcript=gene-EucgrC_p081 gene=nbis-gene-74 name=ndhB seq_id=NC_014570.1 type=exon
TCTCCCACTCCAGTCGTTGCTTTTCTTTCTGTTACTTCGAAAGTAGCTGCTTCAGCTTCA
GCCACTCGAATTTTCGATATTCCTTTTTATTTCTCATCAAACGAATGGCATCTTCTTCTG
GAAATCCTAGCTATTCTTAGCATGATATTGGGGAATCTCATTGCTATTACTCAAACAAGC
ATGAAACGTATGCTTGCATATTCGTCCATAGGTCAAATCGGATATGTAATTATTGGAATA
ATTGTTGGAGACTCAAATGGTGGATATGCGAGCATGATAACTTATATGCTGTTCTATATC
TCCATGAATCTAGGAACTTTTGCTTGCATTGTATTATTTGGTCTACGTACCGGAACTGAT
AACATTCGAGATTATGCAGGATTATACACGAAAGATCCTTTTTTGGCTCTCTCTTTAGCC
CTATGTCTCTTATCCCTAGGAGGTCTTCCTCCACTAGCAGGTTTTTTCGGAAAACTCCAT
TTATTCTGGTGTGGATGGCAGGCAGGCCTATACTTCTTGGTTTCAATAGGACTCCTTACG
AGCGTTATTTCTATCTACTATTATCTAAAAATAATCAAGTTATTAATGACTGGACGAAAC
CAAGAAATAACACCTCACGTGCGAAATTATAGAAGATCCCCTTTAAGATCAAACAATTCC
ATCGAATTGAGTATGATTGTATGTGTGATAGCATCTACTATACCAGGAATATCAATGAAC
CCGATTATTGCAATTGCTCAGGATACCCTTTTTTAG

GCF_000612305.1_Egrandis1_0_genomic.exons.GC.fna

chr length  #A  #C  #G  #T  #4  #CpG
id-EucgrC_p006-2    471 169 68  110 124 0   38
id-EucgrC_p021-1    126 46  19  23  38  0   6
id-EucgrC_p021-2    230 76  43  44  67  0   20
id-EucgrC_p021-3    153 43  27  46  37  0   20
id-EucgrC_p046-3    228 76  40  56  56  0   14
id-EucgrC_p066-2    756 208 154 132 262 0   42
id-EucgrC_p081-2    756 208 154 132 262 0   42

A1.sam.bam

Geneid  Chr Start   End Strand  Length  gene    ./raw_data/A1.sam.bam
id-EucgrC_p006-2    NC_014570.1 12778   13245   -   468 atpF    15
id-EucgrC_p021-1    NC_014570.1 47867   47974   -   108 ycf3    1
id-EucgrC_p021-2    NC_014570.1 46882   47109   -   228 ycf3    1
id-EucgrC_p021-3    NC_014570.1 45999   46148   -   150 ycf3    1
id-EucgrC_p046-3    NC_014570.1 74612   74836   -   225 clpP    4
id-EucgrC_p066-2    NC_014570.1 99515   100267  -   753 ndhB    0
id-EucgrC_p081-2    NC_014570.1 148743  149495  +   753 ndhB    0

All other 390684 features have their length equals.

seqtk featurecounts subread Rsubread agat • 1.2k views

ADD COMMENT • link updated 4.0 years ago by Juke34 8.5k • written 4.0 years ago by marcelolaia ▴ 10

1

Entering edit mode

This is a Question-type post, not a Forum discussion. I've made the necessary change. To learn more, please read posts under the how-to tag.
Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.

ADD REPLY • link 4.0 years ago by Ram 43k

0

Entering edit mode

Thank you so much! I will do the changes in my next post. Sorry for that!

ADD REPLY • link 4.0 years ago by marcelolaia ▴ 10

score 2 · Accepted Answer · 2020-05-06

2

Entering edit mode

4.0 years ago

Juke34 8.5k

It sounds that the stop codon is removed in the last file. That would explain why it is shorter by 3 nucleotides. Maybe sequence with size that do no change do not have stop codon. Could you check

ADD COMMENT • link 4.0 years ago by Juke34 8.5k

0

Entering edit mode

I do a comparison. Here it is.

Gene EucgrC_p006, exon 2
https://www.ncbi.nlm.nih.gov/nuccore/NC_014570.1?report=fasta&from=12778&to=13245
NCBI - 468
featureCounts - 468
AGAT - 471

AGAT return the TAG codon, that (may be) is ignored by featureCounts.

EucgrC_p021, exon 1
https://www.ncbi.nlm.nih.gov/nuccore/NC_014570.1?report=fasta&from=47867&to=47974
NCBI - 108
featureCounts - 108
AGAT - 126

AGAT have included the sequence ATGCCTAGATCGAGGATA

EucgrC_p021, exon 2
https://www.ncbi.nlm.nih.gov/nuccore/NC_014570.1?report=fasta&from=46882&to=47109
NCBI - 228
featureCounts - 228
AGAT - 230

AGAT have included the nucleotides GG at the beginner

EucgrC_p021, exon 3
https://www.ncbi.nlm.nih.gov/nuccore/NC_014570.1?report=fasta&from=45999&to=46148
NCBI - 150
featureCounts - 150
AGAT - 153

AGAT have included the sequence TAA at the end

EucgrC_p046, exon 3
https://www.ncbi.nlm.nih.gov/nuccore/NC_014570.1?report=fasta&from=74612&to=74836
NCBI - 225
featureCounts - 225
AGAT - 228

AGAT have included the sequence TAA at the end

EucgrC_p066, exon 2
https://www.ncbi.nlm.nih.gov/nuccore/NC_014570.1?report=fasta&from=99515&to=100267
NCBI - 753
featureCounts - 753
AGAT - 756

AGAT have included the sequence TAG at the end

EucgrC_p081, exon 2
https://www.ncbi.nlm.nih.gov/nuccore/NC_014570.1?report=fasta&from=148743&to=149495
NCBI - 753
featureCounts - 753
AGAT - 756

AGAT have included the sequence TAG at the end

May be there is some mistake in the GFF file. I do not have knowledge to investigate after this. So sorry.

ADD REPLY • link 4.0 years ago by marcelolaia ▴ 10

1

Entering edit mode

What you see in the GCF_000612305.1_Egrandis1_0_genomic.exons.GC.fna and A1.sam.bam files are based on the CDS features while with AGAT you have run a command to extracted the exon features. You can see it looking at start and stop position of CDS and exon. It is exactly the difference you see between AGAT result and the other files.

e.g for EucgrC_p006:

NC_014570.1 RefSeq  gene    12775   14090   .   -   .   ID=gene-EucgrC_p006;Dbxref=GeneID:9829628;Name=atpF;gbkey=Gene;gene=atpF;gene_biotype=protein_coding;locus_tag=EucgrC_p006
NC_014570.1 RefSeq  CDS 13946   14090   .   -   0   ID=cds-YP_003933950.1;Parent=gene-EucgrC_p006;Dbxref=Genbank:YP_003933950.1,GeneID:9829628;Name=YP_003933950.1;gbkey=CDS;gene=atpF;locus_tag=EucgrC_p006;product=ATP synthase CF0 subunit I;protein_id=YP_003933950.1;transl_table=11
NC_014570.1 RefSeq  CDS 12775   13184   .   -   2   ID=cds-YP_003933950.1;Parent=gene-EucgrC_p006;Dbxref=Genbank:YP_003933950.1,GeneID:9829628;Name=YP_003933950.1;gbkey=CDS;gene=atpF;locus_tag=EucgrC_p006;product=ATP synthase CF0 subunit I;protein_id=YP_003933950.1;transl_table=11
NC_014570.1 RefSeq  exon    12778   13245   .   -   .   ID=id-EucgrC_p006-2;Parent=gene-EucgrC_p006;Dbxref=GeneID:9829628;exon_number=2;gbkey=exon;gene=atpF;locus_tag=EucgrC_p006;number=2
NC_014570.1 RefSeq  exon    13932   14090   .   -   .   ID=id-EucgrC_p006-1;Parent=gene-EucgrC_p006;Dbxref=GeneID:9829628;exon_number=1;gbkey=exon;gene=atpF;locus_tag=EucgrC_p006;number=1

e.g for EucgrC_p021:

NC_014570.1 RefSeq  gene    45996   47992   .   -   .   ID=gene-EucgrC_p021;Dbxref=GeneID:9829651;Name=ycf3;gbkey=Gene;gene=ycf3;gene_biotype=protein_coding;locus_tag=EucgrC_p021
NC_014570.1 RefSeq  CDS 47869   47992   .   -   0   ID=cds-YP_003933963.1;Parent=gene-EucgrC_p021;Dbxref=Genbank:YP_003933963.1,GeneID:9829651;Name=YP_003933963.1;gbkey=CDS;gene=ycf3;locus_tag=EucgrC_p021;product=hypothetical chloroplast RF34;protein_id=YP_003933963.1;transl_table=11
NC_014570.1 RefSeq  CDS 46882   47111   .   -   2   ID=cds-YP_003933963.1;Parent=gene-EucgrC_p021;Dbxref=Genbank:YP_003933963.1,GeneID:9829651;Name=YP_003933963.1;gbkey=CDS;gene=ycf3;locus_tag=EucgrC_p021;product=hypothetical chloroplast RF34;protein_id=YP_003933963.1;transl_table=11
NC_014570.1 RefSeq  CDS 45996   46148   .   -   0   ID=cds-YP_003933963.1;Parent=gene-EucgrC_p021;Dbxref=Genbank:YP_003933963.1,GeneID:9829651;Name=YP_003933963.1;gbkey=CDS;gene=ycf3;locus_tag=EucgrC_p021;product=hypothetical chloroplast RF34;protein_id=YP_003933963.1;transl_table=11
NC_014570.1 RefSeq  exon    45999   46148   .   -   .   ID=id-EucgrC_p021-3;Parent=gene-EucgrC_p021;Dbxref=GeneID:9829651;exon_number=3;gbkey=exon;gene=ycf3;locus_tag=EucgrC_p021;number=3
NC_014570.1 RefSeq  exon    46882   47109   .   -   .   ID=id-EucgrC_p021-2;Parent=gene-EucgrC_p021;Dbxref=GeneID:9829651;exon_number=2;gbkey=exon;gene=ycf3;locus_tag=EucgrC_p021;number=2
NC_014570.1 RefSeq  exon    47867   47974   .   -   .   ID=id-EucgrC_p021-1;Parent=gene-EucgrC_p021;Dbxref=GeneID:9829651;exon_number=1;gbkey=exon;gene=ycf3;locus_tag=EucgrC_p021;number=1

This last case is really interesting. The extra bases in the exon compared to the CDS in exon1 can be UTR but, the extra nucleotide of the exon 2 is obviously an error in the file. Either the CDS should be two nucleotides longer either the exon 2 nucleotides shorter. This file is really fun and original. Some genes have only CDS defined, some others CDS and exons. Usually you don't do a mix of two approaches in a same file.

If you are insterested in CDS, change exon by CDS in your AGAT's command.

BTW the cases you show without stop codon in the CDS is weird too. In GFF stop codons are part of the CDS. I don't know for the other CDS but it sounds to be an error in those CDS.

ADD REPLY • link 3.9 years ago by Juke34 8.5k

0

Entering edit mode

Hi @Juke-34. Thank you. I don't will do anything here. Sorry. I will leave it as is. I do that because I am looking for a solution for this. I need to modify the GFF file to include exon where it isn't there. I'm looking at agat_sp_gxf_to_gff3.pl.

ADD REPLY • link 4.0 years ago by marcelolaia ▴ 10

0

Entering edit mode

Yes the script you mention will add the missing exons and output a gff.

Actually agat_sp_extract_sequences.pl was also creating the missing exons on the fly before to extract their fasta sequences.

ADD REPLY • link 4.0 years ago by Juke34 8.5k