Question: Filter out Human Contigs from Assembly
gravatar for tucanj
2.8 years ago by
tucanj70 wrote:

I have an RNA-Seq assembly. Human reads were aligned and removed before assembly (with hisat2), however some reads made it to the assembly stage and were assembled into contigs. I am interested in the non-human contigs. What is the best way to filter out human contigs from the assembly in a sensitive and specific way while leaving non-human contigs?

Options I have considered:

  • Blast - unmapped contigs
  • Blat - same using pslReps to filter by coverage
  • Spaln or gmap?

When testing with a small sample of human transcripts (the first 5000 in the Ensembl fasta), I found that both Blast and Blat with minCoverage=90 yielded poor sensitivity (only ~3500 filtered out). Any better ideas?

ADD COMMENTlink written 2.8 years ago by tucanj70

Others might have better recommendations, but the way I would approach this would be to create a k-mer hash of the human genome and then find the number of human k-mers in each assembled contig using your language of choice. You can then filter out the assemblies that have more than a given number of human k-mers present based on hash membership. There is likely (from a theoretical standpoint) a large gap between the # of k-mers in a contig from a "good" assembly and one that has human reads mapped, though you might have to normalize by contig length. That shouldn't take more than a couple of hours even on a single process depending on your data size. A k of 32 should be sufficient; lower the k value if you run out of RAM.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Steven Lakin1.4k

Just by gut feeling, a local aligner like BLAST would also be my first idea.

Probably, you can extend your experiment by aligning human transcripts (the remaining of the Ensemble fasta) and transcripts from some other species, which is related to your organism of interest, to the human genome. Afterwards, you can extract different metrics for each mapping or best-hit per transcript (like coverage, percent-identity, bit-score, ...) and compare these metrics between transcripts from human and from the other species.

If you have some experience in it, also some machine learning might help. Because you know from which organism your transcripts originate from, you might use these technique to automatically find suitable thresholds. Or at least thresholds hat give you the best results, i.e., you might tune it towards higher sensitivity or specificity depending on you needs.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Manuel Landesfeind1.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 752 users visited in the last hour