reference genome into chromosomes but annotation only at scaffolds level
1
0
Entering edit mode
1 day ago
murheisa • 0

Hello, I'm working with metatranscriptomics, and I'm trying to remove host reads using STAR, there's a reference genome at chromosomic level, but the annotation is in scaffolds, what should I do? use STAR only with the reference genome, of use the scaffolds and the annotation also in scaffolds? Also, the genome is in NCBI but not the annotation, only in Dryad, recently updated.

reference rnaseq genome scaffolds annotation • 102 views
ADD COMMENT
0
Entering edit mode
1 day ago

Hey murheisa,

For host read removal in metatranscriptomics using STAR, the key is to ensure your reference genome (FASTA) and annotation (GTF/GFF) are compatible—specifically, that the contig/scaffold/chromosome names match exactly between them. If they don't (e.g., chromosomal names like chr1 vs. scaffold names like scaffold_1), STAR will either ignore the annotation or throw errors when building the index with --sjdbGTFfile.

Here are your options, based on what you've described:

  1. Use the chromosomal genome from NCBI without annotation: You can build the STAR index using just the FASTA (--genomeFastaFiles). This will work for alignment, but STAR won't be splice-aware out of the box (it relies on the GTF for annotated junctions). To mitigate this, run in two-pass mode (--twopassMode Basic) to discover novel junctions from the data itself. This is fine for host removal if your goal is just to filter reads, but it's sub-optimal for RNA-seq as you'll miss some splicing events, potentially leading to more false positives in unmapped reads.
  2. Use the scaffold-based assembly and annotation from Dryad: If Dryad provides a matching FASTA (scaffolds) along with the updated annotation, this is likely the better route—build the index with both (--genomeFastaFiles scaffolds.fa --sjdbGTFfile annotation.gtf). This ensures splice-aware alignment, which is ideal for accurate host read depletion. Check the Dryad deposit for the corresponding genome assembly; if it's not there, you may need to source an older scaffold version from NCBI or elsewhere that matches the annotation.
  3. If you must use the chromosomal genome: Consider lifting over the scaffold annotation to the chromosomal assembly using tools like CrossMap or LiftOver (from UCSC), if there's an available chain file mapping scaffolds to chromosomes. This requires knowing the assembly versions (e.g., is the chromosomal one a newer release?). Without that, it's error-prone and not recommended unless you're experienced with it.

In general, for metatranscriptomics, I'd prioritize the assembly with the best annotation for host removal, as accurate junction handling helps avoid leaking host reads into your microbial fraction. Then, use something like samtools view -f 4 or STAR's --outFilterType BySJout to extract unmapped reads for downstream analysis.

Can you provide more details? What's the host organism/species? Links to the exact NCBI genome and Dryad annotation would help confirm compatibility.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 4552 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6