I have fastq files from a RNA sequencing experiment; my samples are human cells infected with an intracellular pathogen, thus I would like to align the total reads on both genomes (human and pathogen). I am working on Linux and I have performed some standard alignments before, using STAR and Ensembl genome reference.
I read it is better to perform the alignment in one step rather than two separated steps. However, I can't figure out how to build the "hybrid" STAR reference genome; ideally, I would like to have an "hybrid" genome where the sequence of the pathogen looks like an additional chromosome at the end of the human genome.
For a standard alignment, I would use STAR in
--runMode genomeGenerate to build the reference; I can provide a "hybrid" fasta to STAR, obtained by concatenating fasta files from human and pathogen sequences (by simply using function
cat). Is it okay?
What about .gtf files? How should I handle them to build the reference (and to count the aligned reads after)?
Note: I downloaded both fasta files from NCBI (as the pathogen sequence is only available from this resource), and both gtf files as well.
I am completely new to these kind of tasks and to the command line, sorry if my question is badly formulated. Thanks to anyone who can help me through this!