Couldn't find any information on this online, so thought this would make a great first post :)
I have performed RNA-Seq (single end, 150bp) on a human cell line infected with a GFP reporter virus. I obtained the virus from a collaborator who doesn't know the exact sequence, but I would like to use the sequence of this reporter virus for downstream analyses. I am curious whether I can deduce the sequence using the RNA-seq data I currently have in combination with publicly available sequences of the strain of the wild-type virus and eGFP.
I have thought about the following strategies:
- Align using STAR, extract chimeric reads aligning to both the viral "chromosome" (WT strain) and eGFP "chromosome", and perform some visual analysis (IGV? Create contigs?) of the chimeric reads to deduce the breakpoints.
- Use STAR-Fusion to obtain potential breakpoint coordinates of fusion transcripts from the viral and eGFP chromosomes. It seems like this relies on a database of mostly human/mouse genes used to find fusions in cancer, so I'm not sure how adaptable this would be.
- Align reads to the human genome, then perform de novo transcriptome assembly on reads that don't map. Haven't done this before either so I'm not sure it will produce what I want.
- Adapt a software used for finding viral integration sites in the human genome (e.g. Virus-Clip, VirusFinder) for finding the insertion site of GFP in a viral genome.
I don't have experience with any of these strategies, so before I go down a rabbit hole I wanted to know if anybody has any advice on which one would be the best to try first/most likely to work, or has any alternative strategies to suggest.
Thanks a ton!