Question

Suitable tool for mapping NGS data from enrichment experiment on combinatorial library

0

Entering edit mode

20 months ago

Filip • 0

Hi everyone,

I'd start with a short description of my experimental data. I have a combinatorial library of synthetic sequences (you could say semi-random), which I used for some in vivo selection experiments and I did amplicon sequencing on Illumina platform. Now I'd like to calculate enrichment for the library members through multiple rounds of the selection. I have a non-redundant list of sequences generated by cd-hit clustering of the naive library from round 0 and I'd like to map the sequences from the other rounds on it to get the counts.

My question is what would be the alignment/mapping tool of choice for such a dataset. So far I've tried bwa-mem2 (with default settings) and it seems to be working well, maybe even too well. There were a lot of supplementary alignments, which I filtered out with samtools, but still it maps ~99.8% of the reads and it makes me a bit suspicious if the mapping is correct, as some of the sequences might be quite similar locally. Would you recommend some other tool, e.g. Salmon? Should I play with the .sam file from bwa-mem2 and filter it more for let's say a certain length of the alignment? Or something completely different? (To clarify - the reference sequences are all of the same length, ~300bp, and I'm mapping merged pair-end reads in .fastq that are slightly longer, +30bp, as they still contain flanks with primer sequences etc.)

Thanks a lot in advance.

salmon NGS bwa combinatorial-library • 805 views

ADD COMMENT • link 20 months ago by Filip • 0

1

Entering edit mode

You should probably use ungapped alignments if you want to get perfect alignments (no secondary/partials etc).

As a non standard solution I will suggest that you use clumpify.sh from BBMap suite to create a non-redundant set of sequences from your data with counts in the header (see: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ). Since your reads are merging (they must be paired end) you should merge and then trim them to remove the extraneous sequence before doing clumpify.

ADD REPLY • link 20 months ago by GenoMax 154k

0

Entering edit mode

Thanks a lot for the reply. So this would be analogous (although faster?) to clustering all the samples with cd-hit 100% identity and use the counts of individual clusters?

ADD REPLY • link 20 months ago by Filip • 0