Hello to everybody,
I am new to bioinformatics so please excuse me if the question is quite stupid.
So I am working on assembling an eukaryotic genome. For this I have three MiSeq paired-end libraries with 500, 5000 and 10000 bp insert sizes. I assembled the genome, but since the organism does not live alone in the culture, but with a mixture of bacteria, I need to decontaminate the data. I do the decontamination with a script written by me (a combination of blastn followed by protein prediction and blastp), and finally I get two files called "clean" and "contaminants".
At this point what I want is to map the reads to the contaminants file and get just the reads which don't map to this file to reassemble the genome. So for this I use bowtie2 to create an index of "contaminants" file and then map the reads and use the --un-conc to get the unmapped reads. But there is a problem.
Generally as I understood, the unmapped reads file contains reads which did not map, and also the reads which did map but disconcordantly. Because of this, when I reassemble the genome I still get some bacterial contamination. Is there any way to force bowtie2 to don't include these sequences also. Or any other software which can do this.
What I want is to get the paired-end reads which don't map at all to the index file (I want to remove also those which map disconcordantly).
I don't want to do it the other way around (mapping the reads to the "clean" file and use -al-conc) because I don't want to lose any reads which were not used at all in the assembly.
Any suggestion is appreciated:) Thank you.