Remove vector sequences from genome database
1
1
Entering edit mode
2.5 years ago
Helena ▴ 10

Hi,

I'm building a database containing Refseq genome sequences from selected bacterial species, which will be used for Nanopore sequencing of environmental samples.

In order to eliminate chances of false positives, I used the UniVec database to locate any potential contamination and got substantial hits to several vectors. I am pretty new to bioinformatics and therefore I wanted to hear if anyone has any ideas of how to mask/remove the contamination from the genome sequences?

/Helena

refseq univec vector database contamination • 835 views
ADD COMMENT
0
Entering edit mode

Are you only creating a database of main chromosomes from bacterial species? Normally the genomes may also include plasmids.

ADD REPLY
0
Entering edit mode

To start I'll create a database containing the chromosomes and afterwards I'll create one for plasmids :) I have already separated the plasmid sequences from the chromosomes.

ADD REPLY
1
Entering edit mode
2.5 years ago

I also work on this, but for lung metagenomes chiefly. We have supplied ref seqs (unknown how useful for your purposes, as no plasmids) here: https://github.com/MHH-RCUG/Wochenende#installation

I wrote a contamination masking tool here, because otherwise it becomes a big source of false positives.

https://github.com/colindaven/blacklister

ADD COMMENT

Login before adding your answer.

Traffic: 1610 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6