Suitable tool for mapping NGS data from enrichment experiment on combinatorial library
0
0
Entering edit mode
20 months ago
Filip • 0

Hi everyone,

I'd start with a short description of my experimental data. I have a combinatorial library of synthetic sequences (you could say semi-random), which I used for some in vivo selection experiments and I did amplicon sequencing on Illumina platform. Now I'd like to calculate enrichment for the library members through multiple rounds of the selection. I have a non-redundant list of sequences generated by cd-hit clustering of the naive library from round 0 and I'd like to map the sequences from the other rounds on it to get the counts.

My question is what would be the alignment/mapping tool of choice for such a dataset. So far I've tried bwa-mem2 (with default settings) and it seems to be working well, maybe even too well. There were a lot of supplementary alignments, which I filtered out with samtools, but still it maps ~99.8% of the reads and it makes me a bit suspicious if the mapping is correct, as some of the sequences might be quite similar locally. Would you recommend some other tool, e.g. Salmon? Should I play with the .sam file from bwa-mem2 and filter it more for let's say a certain length of the alignment? Or something completely different? (To clarify - the reference sequences are all of the same length, ~300bp, and I'm mapping merged pair-end reads in .fastq that are slightly longer, +30bp, as they still contain flanks with primer sequences etc.)

Thanks a lot in advance.

salmon NGS bwa combinatorial-library • 805 views
ADD COMMENT
1
Entering edit mode

You should probably use ungapped alignments if you want to get perfect alignments (no secondary/partials etc).

As a non standard solution I will suggest that you use clumpify.sh from BBMap suite to create a non-redundant set of sequences from your data with counts in the header (see: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ). Since your reads are merging (they must be paired end) you should merge and then trim them to remove the extraneous sequence before doing clumpify.

ADD REPLY
0
Entering edit mode

Thanks a lot for the reply. So this would be analogous (although faster?) to clustering all the samples with cd-hit 100% identity and use the counts of individual clusters?

ADD REPLY

Login before adding your answer.

Traffic: 4128 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6