BLAST arguments to restrict db and get accession IDs
1
0
Entering edit mode
3.5 years ago

Dear community,

I have a list of approximately 1500 RNA sequences (from Sus scrofa) that I want to BLAST against the human refseq_rna database and since the number is more or less big, I cannot use NCBI's web GUI but I have to do that locally. My goal is to get the best matching human mRNA Accession number for each pig RNA I query. Unfortunately, I struggle with thres questions:

  1. Which DB from ftp.ncbi.nlm.nih.gov/blast/db should I download?
  2. How do I restrict my BLAST to human RNA hits only?
  3. How can I get the accession number of my hit?

In the example below you can see that I managed to run BLAST, but I struggeling with the best DB to download and the best parameters to run BLAST. Ideally, my result would include not only scoring values, but also the accession number of the human RNA that matches my pig RNA query best.

blastn = "/usr/bin/blastn" blast_db = "/mnt/sdb/projects/genomic_data/blast/human/GCF_000001405.38_top_level" input = "data/fasta_for_blast.fas" evalue = 1e-6 format = 6 colnames <- c("qseqid",
              "sseqid",
              "pident",
              "length",
              "mismatch",
              "gapopen",
              "qstart",
              "qend",
              "sstart",
              "send",
              "evalue",
              "bitscore")

blast_out <- system2(command = "blastn", 
                     args = c("-db", blast_db, 
                              "-query", input, 
                              "-outfmt", format, 
                              "-evalue", evalue,
                              "-num_threads", 18,
                              "-ungapped"),
                     wait = TRUE,
                     stdout = TRUE) %>%   as_tibble() %>%    separate(col = value, 
           into = colnames,
           sep = "\t",
           convert = TRUE) blast_out

> blast_out
# A tibble: 5 x 12
  qseqid sseqid       pident length mismatch gapopen qstart  qend    sstart      send   evalue bitscore
  <chr>  <chr>         <dbl>  <int>    <int>   <int>  <int> <int>     <int>     <int>    <dbl>    <dbl>
1 Test1  NC_000006.12   94.5    183       10       0    762   944 170569597 170569779 9.97e-77      294
2 Test1  NC_000006.12   97      100        3       0    942  1041 170571407 170571506 7.65e-41      175
3 Test1  NC_000006.12   95.6     68        3       0   1482  1549 170572632 170572699 2.54e-22      114
4 Test1  NC_000006.12  100       35        0       0   1758  1792 170572908 170572942 1.98e- 8       68
5 Test2  NC_000006.12   98      100        2       0   1604  1703 169704007 169704106 1.96e-42      181

(adapted from https://palfalvi.org/post/local-blast-from-r/)

Thank you so much for your help, any input is highly appreciated!

Best, Martin

blast R • 615 views
ADD COMMENT
2
Entering edit mode
3.5 years ago
GenoMax 141k

You can use MANE select data that can be found at the link mentioned on this page. Read about the MANE project while you are there and see if that fits your needs.

You can also consider getting the transcript sequences from GENCODE to create a local blast database.

ADD COMMENT
0
Entering edit mode

Dear Genomax,

That was not exactly what I thought of, but in fact getting the transcript sequences from GENCODE and creating a local blast db solved the problem and was no problem at all - thank you!

Best, Martin

ADD REPLY

Login before adding your answer.

Traffic: 3185 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6