Question: BLAST arguments to restrict db and get accession IDs
0
gravatar for martin.busch
4 months ago by
martin.busch0 wrote:

Dear community,

I have a list of approximately 1500 RNA sequences (from Sus scrofa) that I want to BLAST against the human refseq_rna database and since the number is more or less big, I cannot use NCBI's web GUI but I have to do that locally. My goal is to get the best matching human mRNA Accession number for each pig RNA I query. Unfortunately, I struggle with thres questions:

  1. Which DB from ftp.ncbi.nlm.nih.gov/blast/db should I download?
  2. How do I restrict my BLAST to human RNA hits only?
  3. How can I get the accession number of my hit?

In the example below you can see that I managed to run BLAST, but I struggeling with the best DB to download and the best parameters to run BLAST. Ideally, my result would include not only scoring values, but also the accession number of the human RNA that matches my pig RNA query best.

blastn = "/usr/bin/blastn" blast_db = "/mnt/sdb/projects/genomic_data/blast/human/GCF_000001405.38_top_level" input = "data/fasta_for_blast.fas" evalue = 1e-6 format = 6 colnames <- c("qseqid",
              "sseqid",
              "pident",
              "length",
              "mismatch",
              "gapopen",
              "qstart",
              "qend",
              "sstart",
              "send",
              "evalue",
              "bitscore")

blast_out <- system2(command = "blastn", 
                     args = c("-db", blast_db, 
                              "-query", input, 
                              "-outfmt", format, 
                              "-evalue", evalue,
                              "-num_threads", 18,
                              "-ungapped"),
                     wait = TRUE,
                     stdout = TRUE) %>%   as_tibble() %>%    separate(col = value, 
           into = colnames,
           sep = "\t",
           convert = TRUE) blast_out

> blast_out
# A tibble: 5 x 12
  qseqid sseqid       pident length mismatch gapopen qstart  qend    sstart      send   evalue bitscore
  <chr>  <chr>         <dbl>  <int>    <int>   <int>  <int> <int>     <int>     <int>    <dbl>    <dbl>
1 Test1  NC_000006.12   94.5    183       10       0    762   944 170569597 170569779 9.97e-77      294
2 Test1  NC_000006.12   97      100        3       0    942  1041 170571407 170571506 7.65e-41      175
3 Test1  NC_000006.12   95.6     68        3       0   1482  1549 170572632 170572699 2.54e-22      114
4 Test1  NC_000006.12  100       35        0       0   1758  1792 170572908 170572942 1.98e- 8       68
5 Test2  NC_000006.12   98      100        2       0   1604  1703 169704007 169704106 1.96e-42      181

(adapted from https://palfalvi.org/post/local-blast-from-r/)

Thank you so much for your help, any input is highly appreciated!

Best, Martin

blast R • 166 views
ADD COMMENTlink written 4 months ago by martin.busch0
2
gravatar for GenoMax
4 months ago by
GenoMax96k
United States
GenoMax96k wrote:

You can use MANE select data that can be found at the link mentioned on this page. Read about the MANE project while you are there and see if that fits your needs.

You can also consider getting the transcript sequences from GENCODE to create a local blast database.

ADD COMMENTlink written 4 months ago by GenoMax96k

Dear Genomax,

That was not exactly what I thought of, but in fact getting the transcript sequences from GENCODE and creating a local blast db solved the problem and was no problem at all - thank you!

Best, Martin

ADD REPLYlink modified 3 months ago • written 3 months ago by martin.busch0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2048 users visited in the last hour
_