I'm uncertain about what parameters I'm supposed to feed to FeatureCount. I have a BAM-file and a GFF file (from Prokka). The GFF-file looks like this:
003__g__Streptococcus_no_23 Prodigal:2.6 CDS 6098 6310 . - 0 ID=PLMABFGE_03080;inference=ab initio prediction:Prodigal:2.6;locus_tag=PLMABFGE_03080;product=hypothetical protein 003__g__Streptococcus_no_23 Prodigal:2.6 CDS 6340 6504 . - 0 ID=PLMABFGE_03081;inference=ab initio prediction:Prodigal:2.6;locus_tag=PLMABFGE_03081;product=hypothetical protein 003__g__Streptococcus_no_23 Prodigal:2.6 CDS 6641 7483 . + 0 ID=PLMABFGE_03082;Name=ykuT;db_xref=COG:COG0668;gene=ykuT;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:O34897;locus_tag=PLMABFGE_03082;product=putative MscS family protein YkuT 003__g__Streptococcus_no_23 Prodigal:2.6 CDS 7514 8083 . - 0 ID=PLMABFGE_03083;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P39157;locus_tag=PLMABFGE_03083;note=UPF0340 protein YwlG;product=hypothetical protein 003__g__Streptococcus_no_23 Prodigal:2.6 CDS 8115 8582 . - 0 ID=PLMABFGE_03084;Name=hmpT_2;db_xref=COG:COG4720;gene=hmpT_2;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:A2RIH3;locus_tag=PLMABFGE_03084;product=Thiamine precursor transporter HmpT
The BAM-file is an alignment of my raw FASTQ-files (from a metagenomic sample) mapped against the genes predicted from my assembled bins. The purpose is to figure out, how many reads are mapped to each gene (the depth/abundance of each gene) and lastly, cluster them together in COGs or some sort of functional categorization.
What should I set my -t and -g values to?
Did you try
-t CDS -g ID?
Out of curiosity (you tagged you question with
metagenomics), do you have RNAseq or DNAseq mappings?
Yeah, unfortunately that gives me "Unassigned_NoFeatures" for all of them. I'm working with DNAseq (Illumina).
Are the chromosome names identical?
Does the alignment contain the "Prodigal..." in the names?
Since you did DNA-Seq, I'd try to get a samtools idxstat report first, to get an idea where the alignments are located.
When you say chromosome names, where in the output do you refer to? I suppose I should mention, that the purpose is to figure out the abundance of the genes. The genes comes from a metagenomic sample. This is my output from samtools idxstats (The IDs follow the same patterns as the IDs in the GFF-file, this is just the start of the file instead of a some thousand lines down):
The 3rd column is mapped reads right? So there should be some. The reason there are no unmapped, is because I filtered it with samtools earlier on.
This is from the SUMMARY-file btw.
These are a couple of lines from the corresponding SAM-file (not sorted or filtered).
Your SEQ-ID in the example gff lines is 003__g__Streptococcus_no_23 Prodigal:2.6; that I wanted to know. Using featureCounts, the names and choordinates from your gff and the alignment have to be identical.
Here it seems that your alignment is on transcriptome level, while your gff is on genomic level (
ID=PLMABFGE_03080). In this case, the gff is obsolete as well as featureCounts. Find a way to count the read-pairs effectively on each transcript (e.g. using bedtools intersect, or bedops, or samtools faidx with prior filtering ).
Hi Michael, Thanks for your response. I see. One question though: How can my alignment be on transcriptome level, when I'm dealing with DNA data? And how can the alignment differ from the gff? The alignment is made with paired-end FASTQ files mapped to the ffn-files outputted from Prokka (the same run, as the gff file was outputted). Sorry if I'm misunderstanding something here.
featureCounts needs the GFF contig identifiers (in your case, 003__g__Streptococcus_no_23) to match the contig names from the bam file (in your case, PLMABFGE_05378).
So I have to alter one of the files, so they match (Either to 003__g__Streptococcus_no_23 or PLMABFGE_05378)? 003__g__Streptococcus_no_23 is a contig name, and since multiple genes have been annotated from that contig, multiple IDs (PLMABFGE_05378 for instance) have been generated. So how can I make them match? PLMABFGE_05378 is what Prokka asigned of ID to that specific gene, when it annotated. So there can be multiple IDs for the same contig.
Wouldnt it be the same, to just use samtools idxstats, and then make a small script to extract the read count? Instead of using Featurecounts.