Question

Error when using featurecounts

1

Entering edit mode

2.4 years ago

m.radz ▴ 10

I am doing some RNA analysis and am having issues trying to generate count data. I mapped my reads to a reference genome fasta file (genbank fasta file from ncbi) using bbmap and .sam files as the output.

I am now trying to use featurecounts to generate count data but am running into issues with my gtf file. The gtf file is also from ncbi and corresponds to the alignment fasta file used for mapping. This is it's structure:

#gtf-version 2.2
#!genome-build ASM14559v1
#!genome-build-accession NCBI_Assembly:GCA_000145595.1
CP002120.1      Genbank gene    1       1362    .       +       .       gene_id "SAA6008_00001"; transcript_id ""; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "SAA6008_00001";
CP002120.1      Genbank CDS     1       1359    .       +       0       gene_id "SAA6008_00001"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; gene "dnaA"; locus_tag "SAA6008_00001"; product "chromosomal replication initiator protein DnaA"; protein_id "ADL64089.1"; transl_table "11"; exon_number "1";
CP002120.1      Genbank start_codon     1       3       .       +       0       gene_id "SAA6008_00001"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; gene "dnaA"; locus_tag "SAA6008_00001"; product "chromosomal replication initiator protein DnaA"; protein_id "ADL64089.1"; transl_table "11"; exon_number "1";
CP002120.1      Genbank stop_codon      1360    1362    .       +       0       gene_id "SAA6008_00001"; transcript_id "unassigned_transcript_1"; gbkey "CDS"; gene "dnaA"; locus_tag "SAA6008_00001"; product "chromosomal replication initiator protein DnaA"; protein_id "ADL64089.1"; transl_table "11"; exon_number "1";
CP002120.1      Genbank gene    1640    2773    .       +       .       gene_id "SAA6008_00002"; transcript_id ""; gbkey "Gene"; gene "dnaN"; gene_biotype "protein_coding"; locus_tag "SAA6008_00002";
CP002120.1      Genbank CDS     1640    2770    .       +       0       gene_id "SAA6008_00002"; transcript_id "unassigned_transcript_2"; gbkey "CDS"; gene "dnaN"; locus_tag "SAA6008_00002"; product "DNA-directed DNA polymerase beta subunit"; protein_id "ADL64090.1"; transl_table "11"; exon_number "1";
CP002120.1      Genbank start_codon     1640    1642    .       +       0       gene_id "SAA6008_00002"; transcript_id "unassigned_transcript_2"; gbkey "CDS"; gene "dnaN"; locus_tag "SAA6008_00002"; product "DNA-directed DNA polymerase beta subunit"; protein_id "ADL64090.1"; transl_table "11"; exon_number "1";
CP002120.1      Genbank stop_codon      2771    2773    .       +       0       gene_id "SAA6008_00002"; transcript_id "unassigned_transcript_2"; gbkey "CDS"; gene "dnaN"; locus_tag "SAA6008_00002"; product "DNA-directed DNA polymerase beta subunit"; protein_id "ADL64090.1"; transl_table "11"; exon_number "1";
CP002120.1      Genbank gene    3163    3399    .       +       .       gene_id "SAA6008_00003"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "SAA6008_00003";
CP002120.1      Genbank CDS     3163    3396    .       +       0       gene_id "SAA6008_00003"; transcript_id "unassigned_transcript_3"; gbkey "CDS"; locus_tag "SAA6008_00003"; product "conserved hypothetical protein"; protein_id "ADL64091.1"; transl_table "11"; exon_number "1";
CP002120.1      Genbank start_codon     3163    3165    .       +       0       gene_id "SAA6008_00003"; transcript_id "unassigned_transcript_3"; gbkey "CDS"; locus_tag "SAA6008_00003"; product "conserved hypothetical protein"; protein_id "ADL64091.1"; transl_table "11"; exon_number "1";
CP002120.1      Genbank stop_codon      3397    3399    .       +       0       gene_id "SAA6008_00003"; transcript_id "unassigned_transcript_3"; gbkey "CDS"; locus_tag "SAA6008_00003"; product "conserved hypothetical protein"; protein_id "ADL64091.1"; transl_table "11"; exon_number "1";
CP002120.1      Genbank gene    3396    4508    .       +       .       gene_id "SAA6008_00004"; transcript_id ""; gbkey "Gene"; gene "recF"; gene_biotype "protein_coding"; locus_tag "SAA6008_00004";
CP002120.1      Genbank CDS     3396    4505    .       +       0       gene_id "SAA6008_00004"; transcript_id "unassigned_transcript_4"; gbkey "CDS"; gene "recF"; locus_tag "SAA6008_00004"; note "Required for DNA replication; binds preferentially to single-stranded, linear DNA"; product "DNA repair and genetic recombination protein"; protein_id "ADL64092.1"; t$
CP002120.1      Genbank start_codon     3396    3398    .       +       0       gene_id "SAA6008_00004"; transcript_id "unassigned_transcript_4"; gbkey "CDS"; gene "recF"; locus_tag "SAA6008_00004"; note "Required for DNA replication; binds preferentially to single-stranded, linear DNA"; product "DNA repair and genetic recombination protein"; protein_id "ADL640$
CP002120.1      Genbank stop_codon      4506    4508    .       +       0       gene_id "SAA6008_00004"; transcript_id "unassigned_transcript_4"; gbkey "CDS"; gene "recF"; locus_tag "SAA6008_00004"; note "Required for DNA replication; binds preferentially to single-stranded, linear DNA"; product "DNA repair and genetic recombination protein"; protein_id "ADL640$
CP002120.1      Genbank gene    4518    6452    .       +       .       gene_id "SAA6008_00005"; transcript_id ""; gbkey "Gene"; gene "gyrB"; gene_biotype "protein_coding"; locus_tag "SAA6008_00005";
CP002120.1      Genbank CDS     4518    6449    .       +       0       gene_id "SAA6008_00005"; transcript_id "unassigned_transcript_5"; gbkey "CDS"; gene "gyrB"; locus_tag "SAA6008_00005"; note "TIGRFAM: DNA gyrase, B subunit PFAM: DNA gyrase, subunit B domain protein; ATP-binding region, ATPase domain protein domain protein; TOPRIM

When I try to run featurecounts in R using this code:

featureCounts("s1_mapped.sam", annot.ext = "GCA_000145595.1_ASM14559v1_genomic.gtf", isGTFAnnotationFile = T)

I get the following error:

The chromosome name contains unexpected characters: "CP002120.1 Staphylococcus aureus subsp. aureus str. JKD6008, complete genome" (77 chars)
featureCounts has to stop running

FATAL Error: The program has to terminate and no counting file is generated.

Error in featureCounts("s1_mapped.sam", annot.ext = "GCA_000145595.1_ASM14559v1_genomic.gtf",  : 
  No counts were generated.

Do I need to format my gtf file in some way? I have also tried the code with GTF.featureType = "gene" but I get the same error.

featurecounts R rna • 2.0k views

ADD COMMENT • link updated 2.4 years ago by GenoMax 141k • written 2.4 years ago by m.radz ▴ 10

score 2 · Accepted Answer · 2021-11-30

2

Entering edit mode

2.4 years ago

GenoMax 141k

It looks like your SAM alignment file has the full fasta header name from your reference where as the annotation file has just the accession number (CP002120.1).

Two options:

A: You can create a new index after removing everything after CP002120.1 Staphylococcus aureus subsp. aureus str. JKD6008, complete genome from your reference file leaving just CP002120 and realign the data to the new index.

B: You could also try to truncate the reference names in your existing alignment file by using reformat.sh from BBMap suite like this:

reformat.sh -Xmx2g in=your.sam out=edited.sam trimrname=t

Then use edited.sam for featureCounts.

ADD COMMENT • link 2.4 years ago by GenoMax 141k

0

Entering edit mode

You are correct the headers in the fasta file are:

>CP002120.1 Staphylococcus aureus subsp. aureus str. JKD6008, complete genome

I will try the reformat.sh option, if that doesn't work how would I go about doing option 1?

ADD REPLY • link 2.4 years ago by m.radz ▴ 10

0

Entering edit mode

What aligner did you use? Looks like it passed the entire fasta header from your reference file along (not many aligners do that, they generally truncate the headers after first white space).

You will need to edit the header in reference leaving just the accession. Save that file and re-build a new index, followed by re-alignment of data.

ADD REPLY • link 2.4 years ago by GenoMax 141k

0

Entering edit mode

I used bbmap. Ok I will continue with option 2, if that doesn't work I will delete the unwanted characters and re-do the mapping.

ADD REPLY • link 2.4 years ago by m.radz ▴ 10

3

Entering edit mode

Aha, if you used BBMap then option 2 will work for sure since you have BBMap installed.

In future remember to use option trd=t when you align the data using bbmap.sh. This will truncate reference names after first space. BBMap is one of the rare aligners that passes along the entire fasta header to index and then on to the alignments, unless you truncate the names using the option trd=t.

ADD REPLY • link 2.4 years ago by GenoMax 141k

0

Entering edit mode

Awesome thank you for your help!

ADD REPLY • link 2.4 years ago by m.radz ▴ 10

0

Entering edit mode

Ok so remapping after removing the extra string worked and featurecounts now works so thank you. The only thing I don't like is it is using the gene id (example SAA6008_00001) rather than the gene name (dnaA). Is there a way to run featurecounts so that it uses the gene name instead? I tried using GTF.attrType = "gene" but I get this error:

ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is 'gene' 
An example of attributes included in your GTF annotation is 'gene_id "SAA6008_00003"; transcript_id ""; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "SAA6008_00003"; ' 
The program has to terminate.

Which I don't understand, it can find all the other columns but not gene

ADD REPLY • link 2.4 years ago by m.radz ▴ 10

0

Entering edit mode

Are you adding GTF.featureType = "CDS"? That would be needed for gene to work.

ADD REPLY • link 2.4 years ago by GenoMax 141k