Dear community,
I have a list of approximately 1500 RNA sequences (from Sus scrofa) that I want to BLAST against the human refseq_rna database and since the number is more or less big, I cannot use NCBI's web GUI but I have to do that locally. My goal is to get the best matching human mRNA Accession number for each pig RNA I query. Unfortunately, I struggle with thres questions:
- Which DB from ftp.ncbi.nlm.nih.gov/blast/db should I download?
- How do I restrict my BLAST to human RNA hits only?
- How can I get the accession number of my hit?
In the example below you can see that I managed to run BLAST, but I struggeling with the best DB to download and the best parameters to run BLAST. Ideally, my result would include not only scoring values, but also the accession number of the human RNA that matches my pig RNA query best.
blastn = "/usr/bin/blastn" blast_db = "/mnt/sdb/projects/genomic_data/blast/human/GCF_000001405.38_top_level" input = "data/fasta_for_blast.fas" evalue = 1e-6 format = 6 colnames <- c("qseqid",
"sseqid",
"pident",
"length",
"mismatch",
"gapopen",
"qstart",
"qend",
"sstart",
"send",
"evalue",
"bitscore")
blast_out <- system2(command = "blastn",
args = c("-db", blast_db,
"-query", input,
"-outfmt", format,
"-evalue", evalue,
"-num_threads", 18,
"-ungapped"),
wait = TRUE,
stdout = TRUE) %>% as_tibble() %>% separate(col = value,
into = colnames,
sep = "\t",
convert = TRUE) blast_out
> blast_out
# A tibble: 5 x 12
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
<chr> <chr> <dbl> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl>
1 Test1 NC_000006.12 94.5 183 10 0 762 944 170569597 170569779 9.97e-77 294
2 Test1 NC_000006.12 97 100 3 0 942 1041 170571407 170571506 7.65e-41 175
3 Test1 NC_000006.12 95.6 68 3 0 1482 1549 170572632 170572699 2.54e-22 114
4 Test1 NC_000006.12 100 35 0 0 1758 1792 170572908 170572942 1.98e- 8 68
5 Test2 NC_000006.12 98 100 2 0 1604 1703 169704007 169704106 1.96e-42 181
(adapted from https://palfalvi.org/post/local-blast-from-r/)
Thank you so much for your help, any input is highly appreciated!
Best, Martin
Dear Genomax,
That was not exactly what I thought of, but in fact getting the transcript sequences from GENCODE and creating a local blast db solved the problem and was no problem at all - thank you!
Best, Martin