Hello,
I have my DNA sequences for pure culture bacteria. I did sickle, megahit and metabat to obtain the bin. The completeness of abtained bin is 98.04% (through CheckM). I annotated the bin using Prokka and I got the output as gff file.
The gff file has following format:
k141_1006 Prodigal:002006 CDS 352 882 . + 0 ID=bin1_00001;eC_number=3.4.25.2;Name=hslV;db_xref=COG:COG5405;gene=hslV;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A7B8;locus_tag=bin1_00001;product=ATP-dependent protease subunit HslV
k141_1006 Prodigal:002006 CDS 920 2257 . + 0 ID=bin1_00002;Name=hslU;db_xref=COG:COG1220;gene=hslU;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A6H5;locus_tag=bin1_00002;product=ATP-dependent protease ATPase subunit HslU
k141_5512 Prodigal:002006 CDS 154 1296 . + 0 ID=bin1_00003;eC_number=1.1.1.290;Name=pdxB_1;db_xref=COG:COG0111;gene=pdxB_1;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:Q9I3W9;locus_tag=bin1_00003;product=Erythronate-4-phosphate dehydrogenase
k141_5512 Prodigal:002006 CDS 1350 1523 . - 0 ID=bin1_00004;inference=ab initio prediction:Prodigal:002006;locus_tag=bin1_00004;product=hypothetical protein
k141_5512 Prodigal:002006 CDS 1769 3526 . + 0 ID=bin1_00005;eC_number=7.5.2.6;Name=msbA_1;db_xref=COG:COG1132;gene=msbA_1;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:Q9KQW9;locus_tag=bin1_00005;product=Lipid A export ATP-binding/permease protein MsbA
k141_5512 Prodigal:002006 CDS 3590 4270 . - 0 ID=bin1_00006;inference=ab initio prediction:Prodigal:002006;locus_tag=bin1_00006;product=hypothetical protein
I am unsure how to get different gene ID so that I can use them for feature count. I can see for some contig it given the eC number but for other it didnt give any. Can someone please help.
Thank you in advance
Thank you for your reply I tried to used feature count after using ID = bin1_00001, bin1_00002...so on but its not generating any counts The output is
I can see that my sorted Bam file also has same ID. I am able to pull some mid lines of bam file.
The head of my gff file is
I am not sure where I am going wrong. I tried many times. Any help highly appreciated.
looking at this it seems that your input fasta file and input gff file are not 'in sync'. it looks like the sequence IDs used in the GFF file do not correspond to the one present in the fasta file.
Those
@SQ SN:**bin1_05320** LN:293
should be the same ID as in**k141_1006** 920 2257 + bin1_00002
, which seems to be clearly different. SO you need to make sure you're working with the correct files.Moreover looking at the length of the sequences in your BAM file , thye seem very short to be genomic sequences. Can it be you are using the transcriptome fasta file to map against? if so, you light be better of to directly use something like for instance Salmon on your transcriptome file.
Alternatively you can also use samtools stat option to simply count the number of reads aligning to a given transcript.
The FeatureCount approach requires you map against the genome and then use the GFF file to link those mappings to genes (as they are indicated in the GFF file)
In bam it starts as "@SQ SN:bin1_00001 LN:125" and in gff it starts as "1 k141_1006 920 2257 + bin1_00001". Do you mean I should make 2nd column of bam file as bin1_00001 instead of SN:bin1_00001 for all the rows. In this way it will match to geneID column of gff file. Please correct me If i am wrong.
Bam File
Yes the bam file I got it from my transcriptomes. I removed rRNA and contaminants from my RNA fastq files and then mapped my non_rRNA fastq files against the fasta file (obtained from my privately annotated genome).
I want to quantify the expression of different genes from transcriptomics data. Will salmon and samtools stat (instead of using feature count) help me to get number of reads assigned to each gene. As I studied salmon requires target transcriptome to build index. I have only my transcriptomes fastq files (after trimming and removing rRNA reads) to map to. Do I need to convert my sorted bam file back to fastq file and gtf file to fasta file (transcript.fa) and then run salmon. Please help me to understand If I am wrong.
By this "Feature count requires you to map against the genome": Do you mean mapping the transcriptome file with fasta file (my annotated genome) using feature count. I actually mapped using bowtie2. I am using feature count directly to count reads (after bowtie mapping process). Am I doing something wrong?
no, no. You need to make sure that the IDs from the fasta file/GFF file are the same. so the part behind
SN:
in the bam file should be the first thing in the fasta file after each>
in the fasta header.You never map to a fastq file (those are the reads, which you want to count), but always to a fasta file (either genome or transcriptome).
To work with the FeatureCount approach you will need to map your fastq files to the genome fasta file (that you assembled/annotated). the resulting bam file you can then use as input, together with the correct GFF file, for FeatureCount.
Is the first column in your GFF file really 1 2 3 4 ... or is that a copy/paste issue? That first column in GFF file should match the SN: ID. If that is not the case, don't start editing the BAM file but rerun the analysis with the correct files.
But still, looking at the length of those sequences in the BAM file (last column is the length) they do not look, length-wise, as being genomic sequences (or your assembly is very poor and thus very fragmented). They look more like lengths of genes/CDSs/transcripts ...