Question

how to locate a gene sequence among fastq files containing short reads

0

Entering edit mode

7.9 years ago

jerrybug109 ▴ 10

Hello!

I've got a dozen different strains of bacteria for which we've sequenced the whole genomes of (we have paired end reads - forward and reverse - for each strain). I wish to find and locate a specific house keeping gene in each strain.

Could I convert the fastq files into fasta files, set up a blast database containing the fasta short read files and then blast the query gene sequence against those? Or would I need to assemble each genome first and then make a database out of the assemblies and then blast the query gene sequence against those?

Would appreciate your input, thanks :-)

ncbi blast genome • 3.1k views

ADD COMMENT • link updated 12 months ago by Ram 43k • written 7.9 years ago by jerrybug109 ▴ 10

0

Entering edit mode

Don't do any of that .. yet. Make a "genome" with the gene(s) (if known or choose examples from related strains) you need and then align with BBMap. Depending on how similar "different" strains in your pool are there is some risk that reads may multimap. It sounds like you are just looking to see if a specific gene is there so go ahead and use option ambig=all with BBMap to allow reads to multi-map at all possible locations.

You could also try using BBSplit to bin the reads if you have the reference genomes for these strains.

ADD REPLY • link 7.9 years ago by GenoMax 141k

score 1 · Answer 1 · 2016-06-16

Using blast that way is very inefficient. If you are really impatient to see quick results and if you already have a sequence of the house keeping gene from the same species, than you may take this sequence as reference sequence and map all your reads on it with 'bwa mem'. If you can afford to wait about 5 minutes longer, you should assemble your reads with SPades.

After assembly, there is also no need to blast. It is much easier to map the contigs to the sequence of the house keeping gene with 'bwa mem'. You can even fed the contigs from all of your isolates into 'bwa mem' in a single run and you will get a nice little BAM file showing a multi sequence alignment of all the isolates comprising the house keeping gene. However, if it is really a house keeping gene, than it will be present in all of the isolates.