Question: querying blast database for rna seq reads
0
gravatar for Varun Gupta
2.1 years ago by
Varun Gupta1.1k
United States
Varun Gupta1.1k wrote:

Hi,

I downloaded all the bacterial genomes from ncbi (ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt)

I concatenated all the fasta files into one file and created a database using makeblastdb. I have reads in fastq file which I believe has been contaminated with bacterial sequences. Since I have made the database, how can I use blastn to query this database and will this give me read counts for all the reads mapping to a particular specific bacterial species?

Thanks for the help.

Regards Varun

rna-seq blastn blast • 1.1k views
ADD COMMENTlink modified 2.1 years ago by genomax71k • written 2.1 years ago by Varun Gupta1.1k

Are they paired-end fastq? Are you going to do any QC prior to looking for contamination?

You can align your reads to the reference fasta using bwa.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by st.ph.n2.5k

Hi, for that I have index my combined genome which is around 25G and has 12690 contigs. Do you think aligner would be fine indexing it and then mapping it?

ADD REPLYlink written 2.1 years ago by Varun Gupta1.1k
0
gravatar for genomax
2.1 years ago by
genomax71k
United States
genomax71k wrote:

While feasible using Blast for this will be a fairly compute intensive task (and can take a significant amount of time). If you believe that your data is contaminated then I suggest that you use BBSplit from BBMap suite with known genome and collect reads that map to that genome leaving the contamination behind in a different pool.

There are more efficient ways of assigning bacterial species (e.g. kraken) if you want to continue on present path.

ADD COMMENTlink written 2.1 years ago by genomax71k

Do you think indexing and mapping with an aligner would be an option to look into, although I think the genome to be indexed is very large 25G??

ADD REPLYlink written 2.1 years ago by Varun Gupta1.1k

Can you be specific about mapping what to what? BBMap should handle a genome that large AFAIK. You will need to have enough RAM available (for reference human genome generally needs about 30G).

ADD REPLYlink written 2.1 years ago by genomax71k

I created a reference fasta file for all the bacterial species and combined them into one fasta file. Taking this fasta file as genome, I want to map my paired-end RNA-Seq reads to this fasta file to see how many reads are coming from which bacterial species. Since I combined all the bacterial species, total size of the fasta file is 25G which is very big(bigger than human genome)

ADD REPLYlink written 2.1 years ago by Varun Gupta1.1k

See my comment above for feasibility of using that fasta for analysis. Should be possible as long as you have enough RAM available.

Otherwise this is a question of priorities. Are you interested in moving forward with the reads that map to the genome of your interest (which I assume is not of bacterial origin) or is your interest the bacterial reads themselves. There is also the 16S microbial blast index that NCBI makes available. Presumably your data will have 16S reads. Blasting a sample of reads can provide you with some information about the bacterial contamination.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by genomax71k

This microbial blast index might give me some idea quickly. To use this microbial blast index, I should just use blastn with my fastq reads on this index? Thanks genomax for the help

ADD REPLYlink written 2.1 years ago by Varun Gupta1.1k

You can convert your fastq reads to fasta using reformat.sh in=your_reads.fq out=your_reads.fa from BBMap suite and then on to blast. (use in1= in2= out1= out2= for a PE dataset).

ADD REPLYlink written 2.1 years ago by genomax71k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1564 users visited in the last hour