I have a question related to a fairly common meta-genomic and meta-transcriptomic analysis.
I am working on a project whereby, among other things, we are assessing the expression of antibiotic resistance genes (ARGs) in gut microbiota of a species. This is a fairly common procedure for those working in this field.
The procedure works by assigning mRNAseq reads to their corresponding gene through mapping and counting, pretty much like htseq/deseq2. Counts define expression which translates to antibiotic resistance.
Now, one of the key features of ARGs is that some of them require SNP confirmation, which means that only the presence of certain alleles confers resistance, not just the expression of the gene.
My doubts start from here: If I count the reads that contain the SNPs, this will not be linearly proportional to the number of transcripts that have the resistance. This because my reads (single end, 100 nt) are shorter than the transcripts, so I will only find the proportion of reads that confer resistance at each detected SNP.
My approach so far has been to call SNPs, extract the allele fraction of each "resistance SNP", and then average them across each gene to infer a fraction of reads that confirm the resistance through the SNPs. But this workflow is quite dirty and has a lot of assumptions: how would you proceed? what could you suggest to me, if you did it in the past?
(note to experts: I can't use amr++ because it is made for paired-end reads)