Hello, I'm working with metatranscriptomics, and I'm trying to remove host reads using STAR, there's a reference genome at chromosomic level, but the annotation is in scaffolds, what should I do? use STAR only with the reference genome, of use the scaffolds and the annotation also in scaffolds?
Also, the genome is in NCBI but not the annotation, only in Dryad, recently updated.
For host read removal in metatranscriptomics using STAR, the key is to ensure your reference genome (FASTA) and annotation (GTF/GFF) are compatible—specifically, that the contig/scaffold/chromosome names match exactly between them. If they don't (e.g., chromosomal names like chr1 vs. scaffold names like scaffold_1), STAR will either ignore the annotation or throw errors when building the index with --sjdbGTFfile.
Here are your options, based on what you've described:
Use the chromosomal genome from NCBI without annotation: You can
build the STAR index using just the FASTA (--genomeFastaFiles). This
will work for alignment, but STAR won't be splice-aware out of the
box (it relies on the GTF for annotated junctions). To mitigate
this, run in two-pass mode (--twopassMode Basic) to discover novel
junctions from the data itself. This is fine for host removal if
your goal is just to filter reads, but it's sub-optimal for RNA-seq
as you'll miss some splicing events, potentially leading to more
false positives in unmapped reads.
Use the scaffold-based assembly and annotation from Dryad:
If Dryad provides a matching FASTA
(scaffolds) along with the updated annotation, this is likely the
better route—build the index with both (--genomeFastaFiles
scaffolds.fa --sjdbGTFfile annotation.gtf). This ensures
splice-aware alignment, which is ideal for accurate host read
depletion. Check the Dryad deposit for the corresponding genome
assembly; if it's not there, you may need to source an older
scaffold version from NCBI or elsewhere that matches the annotation.
If you must use the chromosomal genome:
Consider lifting over the
scaffold annotation to the chromosomal assembly using tools like
CrossMap or LiftOver (from UCSC), if there's an available chain file
mapping scaffolds to chromosomes. This requires knowing the assembly
versions (e.g., is the chromosomal one a newer release?). Without
that, it's error-prone and not recommended unless you're experienced
with it.
In general, for metatranscriptomics, I'd prioritize the assembly with the best annotation for host removal, as accurate junction handling helps avoid leaking host reads into your microbial fraction. Then, use something like samtools view -f 4 or STAR's --outFilterType BySJout to extract unmapped reads for downstream analysis.
Can you provide more details? What's the host organism/species? Links to the exact NCBI genome and Dryad annotation would help confirm compatibility.