Question

Filter out Human Contigs from Assembly

2

Entering edit mode

7.9 years ago

tucanj ▴ 100

I have an RNA-Seq assembly. Human reads were aligned and removed before assembly (with hisat2), however some reads made it to the assembly stage and were assembled into contigs. I am interested in the non-human contigs. What is the best way to filter out human contigs from the assembly in a sensitive and specific way while leaving non-human contigs?

Options I have considered:

Blast - unmapped contigs
Blat - same using pslReps to filter by coverage
Spaln or gmap?

When testing with a small sample of human transcripts (the first 5000 in the Ensembl fasta), I found that both Blast and Blat with minCoverage=90 yielded poor sensitivity (only ~3500 filtered out). Any better ideas?

RNA-Seq Assembly sequencing alignment • 1.9k views

ADD COMMENT • link 7.9 years ago by tucanj ▴ 100

0

Entering edit mode

Others might have better recommendations, but the way I would approach this would be to create a k-mer hash of the human genome and then find the number of human k-mers in each assembled contig using your language of choice. You can then filter out the assemblies that have more than a given number of human k-mers present based on hash membership. There is likely (from a theoretical standpoint) a large gap between the # of k-mers in a contig from a "good" assembly and one that has human reads mapped, though you might have to normalize by contig length. That shouldn't take more than a couple of hours even on a single process depending on your data size. A k of 32 should be sufficient; lower the k value if you run out of RAM.

ADD REPLY • link 7.9 years ago by Steven Lakin ★ 1.8k

0

Entering edit mode

Just by gut feeling, a local aligner like BLAST would also be my first idea.

Probably, you can extend your experiment by aligning human transcripts (the remaining of the Ensemble fasta) and transcripts from some other species, which is related to your organism of interest, to the human genome. Afterwards, you can extract different metrics for each mapping or best-hit per transcript (like coverage, percent-identity, bit-score, ...) and compare these metrics between transcripts from human and from the other species.

If you have some experience in it, also some machine learning might help. Because you know from which organism your transcripts originate from, you might use these technique to automatically find suitable thresholds. Or at least thresholds hat give you the best results, i.e., you might tune it towards higher sensitivity or specificity depending on you needs.

ADD REPLY • link 7.9 years ago by Manuel Landesfeind ★ 1.4k