Question: featureCounts - annotation file issue
0
gravatar for gokberk
8 days ago by
gokberk20
gokberk20 wrote:

Hi all,

I've been using featureCounts to generate count tables out of my bam files. Previously, it worked fine with bam files which I generated with Subread.

Now, I'm using featureCounts with the bam files I generated with HiSAT2. However, non of the alignments were assigned to any genes, since the chromosome names in my gtf file and bam files were different. I changed the chromosome names in my bam files following the instructions in this post. However, the bam file I generate following this method turns out to be corrupted somehow. samtools view mybam.bam | head command does not give any output and when I run featureCounts, I receive "GZIP ERROR: -5" and still non of the alignments gets assigned to a gene. So I wonder how I can fix this discrepancy between my bam files and gtf file.

I used awk to format the header file and changed all chromosome names accordingly, but it didn't fix the issue. However, when I change chromosome names, blanks between columns change as well for some reason, meaning if there was a tab, it turns into a single space. So, I wonder if there is another way of solving this issue. I would be more than happy if you could help me out.

Here is how my gtf, header and old bam files look right now:

gtf:

NC_005943.1 RefSeq  CDS 3259    4213    .   +   0   transcript_id "gene34201"; gene_id "gene34201"; gene_name "ND1";
NC_005943.1 RefSeq  CDS 4421    5462    .   +   0   transcript_id "gene34202"; gene_id "gene34202"; gene_name "ND2";
NC_005943.1 RefSeq  CDS 5850    7418    .   +   0   transcript_id "gene34203"; gene_id "gene34203"; gene_name "COX1";
NC_005943.1 RefSeq  CDS 7532    8215    .   +   0   transcript_id "gene34204"; gene_id "gene34204"; gene_name "COX2";
NC_005943.1 RefSeq  CDS 8357    8563    .   +   0   transcript_id "gene34205"; gene_id "gene34205"; gene_name "ATP8";
NC_005943.1 RefSeq  CDS 8518    9198    .   +   0   transcript_id "gene34206"; gene_id "gene34206"; gene_name "ATP6";

old header file:

@HD VN:1.0  SO:coordinate
@SQ SN:chr1 LN:225584828
@SQ SN:chr10    LN:92844088
@SQ SN:chr11    LN:133663169
@SQ SN:chr12    LN:125506784
@SQ SN:chr13    LN:108979918
@SQ SN:chr14    LN:127894412
@SQ SN:chr15    LN:111343173
@SQ SN:chr16    LN:77216781
@SQ SN:chr17    LN:95684472
@SQ SN:chr18    LN:70235451
@SQ SN:chr19    LN:53671032
@SQ SN:chr2 LN:204787373
@SQ SN:chr20    LN:74971481
@SQ SN:chr3 LN:185818997
@SQ SN:chr4 LN:172585720
@SQ SN:chr5 LN:190429646
@SQ SN:chr6 LN:180051392
@SQ SN:chr7 LN:169600520
@SQ SN:chr8 LN:144306982
@SQ SN:chr9 LN:129882849
@SQ SN:chrM LN:16564
@SQ SN:chrUn_NW_014806053v1 LN:1653

fixed header file:

@HD    VN:1.0  SO:coordinate
@SQ SN:NC_041754.1  LN:225584828
@SQ SN:NC_041763.1  LN:92844088
@SQ SN:NC_041764.1  LN:133663169
@SQ SN:NC_041765.1  LN:125506784
@SQ SN:NC_041766.1  LN:108979918
@SQ SN:NC_041767.1  LN:127894412
@SQ SN:NC_041768.1  LN:111343173
@SQ SN:NC_041769.1  LN:77216781
@SQ SN:NC_041770.1  LN:95684472
@SQ SN:NC_041771.1  LN:70235451
@SQ SN:NC_041772.1  LN:53671032
@SQ SN:NC_041755.1  LN:204787373
@SQ SN:NC_041773.1  LN:74971481
@SQ SN:NC_041756.1  LN:185818997
@SQ SN:NC_041757.1  LN:172585720
@SQ SN:NC_041758.1  LN:190429646
@SQ SN:NC_041759.1  LN:180051392
@SQ SN:NC_041760.1  LN:169600520
@SQ SN:NC_041761.1  LN:144306982
@SQ SN:NC_041762.1  LN:129882849
@SQ SN:NC_005943.1  LN:16564
@SQ SN:NW_014806053.1   LN:1653
@SQ SN:NW_014806054.1   LN:997
@SQ SN:NW_014806055.1   LN:843

old bam:

SRR5991098.18949202 0   chr1    42294   60  40M *   0   0   GCCCTGCCTCCCACAGCTTTATTTCTTTTGGTTTTGGATG    HHHJJJJJJJJJJJJJJJJJIJJJIJJJJJIGIJJJIIJJ    AS:i:0  XN:i:0  XM:i:0XO:i:0    XG:i:0  NM:i:0  MD:Z:40 YT:Z:UU NH:i:1
SRR5991098.2109594  0   chr1    42295   60  40M *   0   0   CCCTGCCTCCCACAGCTTTATTTCTTTTGGTTTTGGATGC    HHGJJJJJJJJIJCHIIJJGIIJIJJJJIIFHHIGICHFH    AS:i:0  XN:i:0  XM:i:0XO:i:0    XG:i:0  NM:i:0  MD:Z:40 YT:Z:UU NH:i:1
SRR5991098.16345197 0   chr1    42301   60  40M *   0   0   CTCCCACAGCTTTATTTCTTTTGGTTTTGGATGCAAAACA    D:ADEDA;;=+AEEFFEECBEDI<D;>1B<>DD9DDEDAD    AS:i:0  XN:i:0  XM:i:0XO:i:0    XG:i:0  NM:i:0  MD:Z:40 YT:Z:UU NH:i:1
SRR5991098.31884339 0   chr1    42303   60  40M *   0   0   CCCACAGCTTTATTTCTTTTGGTTTTGGATGCAAAACAAA    HHHJJJJJJJJJJJJIJJJJJJHIIJIJIIJJJJJJJJJJ    AS:i:0  XN:i:0  XM:i:

Cheers!

rna-seq • 86 views
ADD COMMENTlink modified 8 days ago • written 8 days ago by gokberk20
4
gravatar for geek_y
8 days ago by
geek_y9.6k
Barcelona/CRG/London/Imperial
geek_y9.6k wrote:

I would change chromosome names in GTF which is also computationally efficient.

ADD COMMENTlink written 8 days ago by geek_y9.6k

Thanks for the advice geek_y! While I was trying to do what you suggested, I realized that the chromosome names in my gtf file and the chromosome names that are given at NCBI's website that I downloaded this gtf file do not match. So, I found the correct chromosome name from the gft file itself and it fixed my problem. Thanks again!

ADD REPLYlink written 8 days ago by gokberk20

Instead of closing the question, please mark the answer as accepted to indicate that it solved your problem. That will help others in the future.

enter image description here

ADD REPLYlink written 8 days ago by ATpoint16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1385 users visited in the last hour