How To Remove Few Eukaryotic Sequences From Bacterial Sequence Dataset?
Entering edit mode
11.3 years ago
Raghul ▴ 200

Hi I removed the bacterial sequences from a eukaryotic dataset by comparing with bacterial database.As a result there were around 2,000 sequences. But now there are some 20 eukaryotic sequences in this resultant bacterial dataset. How do I remove this eukaryotic sequences? I am thinking of doing blastn with ncbi-nt database and checking manually & it will work. Anyway, is there a better way to do it? Is it possible to get an idea of what families do these bacteria belong based on blastn/blastp result?

thank you raghul

classification blast • 3.3k views
Entering edit mode
11.3 years ago
Josh Herr 5.8k

I often have this issue with metagenomic data, so I understand you frustration here. Much of this depends on your experimental design, which you didn't give us much to go on. If you're trying to remove bacterial contamination sequences from human transcriptome study, that is one thing, but part of the problem with BLAST is you often cannot be definitively sure if what you have is bacterial, archaeal, or eukaryotic solely based on BLAST hit/homology. Although, depending on the type of data you have, it should be fairly easy to identify the bacterial to the family level as long as you have coding regions in your sequencing data.

I think using BLAST here is not a bad idea, especially if you only have 20 or so sequences which you could remove by eye after the query. I would use very strict search criteria.

It may help to you to use a BLAST parser such as the MEGAN program for metagenomics or other programs (possibly GLIMMER depending on what type of data you have) that use different match algorithms to BLAST. I would avoid faster, yet less stringent programs such as BLAT. If you're using marker genes, you can cluster with one of many clustering programs (I like CD-HIT and UCLUST/USEARCH) and pull out the bacterial sequences fairly quickly.


Login before adding your answer.

Traffic: 1763 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6