Using featureCount with a GFF file
1
2
Entering edit mode
23 months ago

Hi

I'm uncertain about what parameters I'm supposed to feed to FeatureCount. I have a BAM-file and a GFF file (from Prokka). The GFF-file looks like this:

003__g__Streptococcus_no_23 Prodigal:2.6    CDS 6098    6310    .   -   0   ID=PLMABFGE_03080;inference=ab initio prediction:Prodigal:2.6;locus_tag=PLMABFGE_03080;product=hypothetical protein
003__g__Streptococcus_no_23 Prodigal:2.6    CDS 6340    6504    .   -   0   ID=PLMABFGE_03081;inference=ab initio prediction:Prodigal:2.6;locus_tag=PLMABFGE_03081;product=hypothetical protein
003__g__Streptococcus_no_23 Prodigal:2.6    CDS 6641    7483    .   +   0   ID=PLMABFGE_03082;Name=ykuT;db_xref=COG:COG0668;gene=ykuT;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:O34897;locus_tag=PLMABFGE_03082;product=putative MscS family protein YkuT
003__g__Streptococcus_no_23 Prodigal:2.6    CDS 7514    8083    .   -   0   ID=PLMABFGE_03083;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P39157;locus_tag=PLMABFGE_03083;note=UPF0340 protein YwlG;product=hypothetical protein
003__g__Streptococcus_no_23 Prodigal:2.6    CDS 8115    8582    .   -   0   ID=PLMABFGE_03084;Name=hmpT_2;db_xref=COG:COG4720;gene=hmpT_2;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:A2RIH3;locus_tag=PLMABFGE_03084;product=Thiamine precursor transporter HmpT


The BAM-file is an alignment of my raw FASTQ-files (from a metagenomic sample) mapped against the genes predicted from my assembled bins. The purpose is to figure out, how many reads are mapped to each gene (the depth/abundance of each gene) and lastly, cluster them together in COGs or some sort of functional categorization.

What should I set my -t and -g values to?

Featurecount gff alignment metagenomics • 2.2k views
2
Entering edit mode

Did you try -t CDS -g ID?

Out of curiosity (you tagged you question with metagenomics), do you have RNAseq or DNAseq mappings?

0
Entering edit mode

Yeah, unfortunately that gives me "Unassigned_NoFeatures" for all of them. I'm working with DNAseq (Illumina).

1
Entering edit mode

Are the chromosome names identical?

Does the alignment contain the "Prodigal..." in the names?

Since you did DNA-Seq, I'd try to get a samtools idxstat report first, to get an idea where the alignments are located.

0
Entering edit mode

When you say chromosome names, where in the output do you refer to? I suppose I should mention, that the purpose is to figure out the abundance of the genes. The genes comes from a metagenomic sample. This is my output from samtools idxstats (The IDs follow the same patterns as the IDs in the GFF-file, this is just the start of the file instead of a some thousand lines down):

PLMABFGE_00001  741 13  0
PLMABFGE_00002  465 6   0
PLMABFGE_00003  1104    11  0
PLMABFGE_00004  1059    6   0
PLMABFGE_00005  1176    12  0
PLMABFGE_00006  780 6   0
PLMABFGE_00007  936 10  0
PLMABFGE_00008  1155    9   0
PLMABFGE_00009  711 10  0
PLMABFGE_00010  1311    18  0
PLMABFGE_00011  1356    12  0
PLMABFGE_00012  1005    20  0


The 3rd column is mapped reads right? So there should be some. The reason there are no unmapped, is because I filtered it with samtools earlier on.

This is from the SUMMARY-file btw.

Status  mapping/90-20-09-2018/90-20-09-2018.bam
Assigned    0
Unassigned_Unmapped 0

Unassigned_MappingQuality   0

Unassigned_Chimera  0

Unassigned_FragmentLength   0

Unassigned_Duplicate    0

Unassigned_MultiMapping 0

Unassigned_Secondary    0

Unassigned_NonSplit 0

Unassigned_NoFeatures   71731

Unassigned_Overlapping_Length   0

Unassigned_Ambiguity    0


These are a couple of lines from the corresponding SAM-file (not sorted or filtered).

A00489:125:HCGNNDRXX:1:2216:9236:5243   99  PLMABFGE_05378  108 0   40S19M92S   =   108 19  TCCCGATTGAAGTTAGTGCATTCCATGCAGCATCACTGAATGCAGTGGATTCCATGCAGCATGGTAAAATGCATCCCGATTGAACGCAGTGTATTCCATGAAGCATGGCGAAATGGATCCCAATTGAATGCAGTGGATTCCATGCAGCGTG FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF,:FFFFFFF:FFFFFFFFF:FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF:,FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFF NM:i:0  MD:Z:19 MC:Z:13S19M119S AS:i:19 XS:i:0
A00489:125:HCGNNDRXX:1:2216:9236:5243   147 PLMABFGE_05378  108 0   13S19M119S  =   108 -19 CAGCATCACTGAATGCAGTGGATTCCATGCAGCATGGTAAAATGCATCCCGATTGAACGCAGTGTATTCCATGAAGCATGGCGAAATGGATCCCAATTGAATGCAGTGGATTCCATGCAGCGTGGCCCAATGGATCTCGATTGAATGCAGT FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF

2
Entering edit mode

Your SEQ-ID in the example gff lines is 003__g__Streptococcus_no_23 Prodigal:2.6; that I wanted to know. Using featureCounts, the names and choordinates from your gff and the alignment have to be identical.

Here it seems that your alignment is on transcriptome level, while your gff is on genomic level (ID=PLMABFGE_03080). In this case, the gff is obsolete as well as featureCounts. Find a way to count the read-pairs effectively on each transcript (e.g. using bedtools intersect, or bedops, or samtools faidx with prior filtering ).

0
Entering edit mode

Hi Michael, Thanks for your response. I see. One question though: How can my alignment be on transcriptome level, when I'm dealing with DNA data? And how can the alignment differ from the gff? The alignment is made with paired-end FASTQ files mapped to the ffn-files outputted from Prokka (the same run, as the gff file was outputted). Sorry if I'm misunderstanding something here.

1
Entering edit mode

featureCounts needs the GFF contig identifiers (in your case, 003__g__Streptococcus_no_23) to match the contig names from the bam file (in your case, PLMABFGE_05378).

0
Entering edit mode

So I have to alter one of the files, so they match (Either to 003__g__Streptococcus_no_23 or PLMABFGE_05378)? 003__g__Streptococcus_no_23 is a contig name, and since multiple genes have been annotated from that contig, multiple IDs (PLMABFGE_05378 for instance) have been generated. So how can I make them match? PLMABFGE_05378 is what Prokka asigned of ID to that specific gene, when it annotated. So there can be multiple IDs for the same contig.

0
Entering edit mode

Wouldnt it be the same, to just use samtools idxstats, and then make a small script to extract the read count? Instead of using Featurecounts.

2
Entering edit mode
23 months ago

I found the issue. I was using an alignment of the genes rather than the chromosomes, hence the confusion. I made a new alignment, and everything looks good now!