I have RNASeq data coming from a transgenic mouse (where a single gene's coding sequence is replaced by the gene sequence from another organism). I need to quantify the expression of this transgene (get the number of aligned reads). It seems to me that the most comprehensive and accurate approach would be the following:
1) generate a custom reference genome (.fasta file), where the respective sequence is replaced with the new transgenic sequence 2) modify the entries for this gene in the gene annotation (.gtf) 3) use the modified reference genome and gene annotation to do the alignment and gene quantification.
Does that sound like the correct approach, or are there some issues that I don't see?
Also, I have problems with implementing my plan. For #1, I couldn't find a good tool to replace the sequence with another one in the .fasta file - could you recommend something? I am not proficient with python, so would prefer a ready-made tool.
Another concern that I have is that because the original and new sequences in the fasta file have different lengths, the whole annotation (or at least one chromosome) will be misaligned in relation to the modified reference genome. How can this be resolved?
Could anyone suggest other general approaches? Would it be more straightforward to use just a single gene sequence as a reference, and align the whole dataset to it? If yes, then what tool would you recommend?