I am very new to this line of analyses so please be kind and I am sorry if I miss any information.
I am interested in calculating the abundance of carbohydrate active enzyme sequences in my samples (but using a co-assembly).
I have co-assembled my samples (Megahit) and mapped the reads of each sample to the co-assembly (bowtie). I have also used dbcan to annotate the co-assembly with the carbohydrate active enzyme database. I then used ht-seq count to count the number of reads mapped to each gene in each sample Therefore, I currently have the counts for each sample but I am confused about how to normalise the counts. I also have a gtf file with all the gene calls for the co-assembly which looks like:
argelvor_000000000001 PROKKA CDS 2 304 . + . gene_id 1_1 argelvor_000000000002 PROKKA CDS 1 168 . - . gene_id 2_1 argelvor_000000000003 PROKKA CDS 1 384 . + . gene_id 3_1 argelvor_000000000004 PROKKA CDS 1 321 . + . gene_id 4_1 argelvor_000000000005 PROKKA CDS 30 530 . - . gene_id 5_1 argelvor_000000000006 PROKKA CDS 1 96 . + . gene_id 6_1 argelvor_000000000007 PROKKA CDS 1 558 . + . gene_id 7_1 argelvor_000000000008 PROKKA CDS 2 484 . - . gene_id 8_1 argelvor_000000000009 PROKKA CDS 2 142 . + . gene_id 9_1 argelvor_000000000009 PROKKA CDS 191 343 . + . gene_id 9_2
And a standard count matrix where gene ids are rows and samples are columns. Is it possible from this information to calculate FPKM (and have it automized). I am most comfortable in R but would welcome any suggestions.
Once I have the FPKM values, I can then use the gene ID's to map to the output of dbcan!