I need to create a reference sequence index for TopHat; it is straightforward for the human chromosomes alone, but what if I need to index genomes from human cell lines containing viral integrated sequences, for instance SiHa (containing human papillomavirus) and Akata (containing Epstein-Barr virus)? How can a create a reference for these cells? Is there a GTF file for cell lines?
You can either create a combined index by combining the fasta files and the gtfs of human and the virus annotation, or create an index for the virus genomes and map to that first. The unmapped reads can then be used to align against the human annotation.