Question

Parsing blastn (or blastx) results as transcript > hit name

2

Entering edit mode

7.8 years ago

Farbod ★ 3.4k

Hi, I have created a fasta file of 100 of my interest genes and I have used my transcriptome assembly fasta file as blastable database and have run blastn to check that how many of them (my transcripts) have similarity (with my collected genes) and which of them to which transcriptome.

some of my genes of interest have different homologous in my fasta file (e.g sox9 of human, sox9 of zebrafish, sox9 of turtle), so it is possible that one transcipt show hits with several genes.

is there any way to just get the blast results as following :

transctipt(s) name -------------------------> hit(s) name

e.g:

1- Trinity_DNXXXX1 --------------------------- Danio rerio sox9a), and Homo sapience SOX3

2- Trinity_DNXXXX2 --------------------------- Homo sapience SRY (sex determining region Y)-box 9a (sox9a), mRNA

3- Trinity_DNXXXX2 and Trinity_DNXXXX5 ----------------------------- Danio rerio SRY (sex determining region Y)-box 9a (sox9a), mRNA

blast • 1.7k views

ADD COMMENT • link 7.8 years ago by Farbod ★ 3.4k

score 3 · Answer 1 · 2016-07-14

3

Entering edit mode

7.8 years ago

dschika ▴ 320

Have you tried tabular output format? See -outfmt in: blastn -help

From the tabular output it's not too far to the output format you want to have.

ADD COMMENT • link 7.8 years ago by dschika ▴ 320

0

Entering edit mode

Hi dear Dschika, yes I have tried it but the results is as below :

gi|190338053|gb|BC162638.1| ----- TR170077|c1_g1_i6 79.63 535 97 6 748 1279 2599 3124 3e-102 374 gi|115313651|gb|BC124087.1| ------ TR179410|c3_g1_i1 100.00 30 0 0 1713 1742 6 35 2e-06 56.5 gi|666440464|gb|KF601249.1|------- TR123032|c5_g1_i5 96.15 78 3 0 8 85 991 1068 4e-28 128 gi|115432022|ref|NM_001039634.2| ------ TR179410|c3_g1_i1 100.00 30 0 0 1713 1742 6 35 2e-06

which you can see that the name of the genes (e.g SOX9 - the bold part) is not shown and each time I must search for the ID of genes in my Genes.fasta file one by one ! (the TR** is my transcript ID)

ADD REPLY • link 7.8 years ago by Farbod ★ 3.4k

1

Entering edit mode

Have you tried something like:

-outmt '6 stitle qseqid'

ADD REPLY • link 7.8 years ago by dschika ▴ 320

1

Entering edit mode

Dear Dschika, Thanks for your help,

I tried it but the results is again as :

TRINITY_DN158974_c0_g1_i1 ----------- gi|283137712|gb|GQ924783.1|

and the gene name of the "gi|283137712|gb|GQ924783.1|" id is as below:

gi|283137712|gb|GQ924783.1| Dicentrarchus labrax insulin growth factor 1 gene, complete cds

which "Dicentrarchus labrax insulin growth factor 1" is important for me.

so , I want an output like this:

TRINITY_DN158974_c0_g1_i1 --------------------- Dicentrarchus labrax insulin growth factor 1

or even:

TRINITY_DN158974_c0_g1_i1 --------------------- gi|283137712|gb|GQ924783.1| Dicentrarchus labrax insulin growth factor 1 gene, complete cds

And this is my script :

blastn -query '/home/All--nuclotides-ncbi-original-full.txt' -db Trinity.fasta -out blastn-all-test-qseqid.outfmt6 -evalue 1e-6 -num_threads 24 -max_target_seqs 1 -outfmt '6 stitle qseqid'

long to hear from you!

ADD REPLY • link 7.8 years ago by Farbod ★ 3.4k

1

Entering edit mode

ah, I mistook what is the subject and what is the query. As you wrote in your first post, you made a blastdb with the transcripts (the TRINITY sequences) and you blast your genes against that db. Try it the other way round: make a blastdb with your genes of interest and blast your transcripts against that db. Then outfmt '6 stitle qseqid' should show you e.g., "Dicentrarchus labrax insulin growth factor 1 TRINITY_DN158974_c0_g1_i1".

ADD REPLY • link 7.8 years ago by dschika ▴ 320

1

Entering edit mode

Hi, I really appreciate your help, Is there any way to just do it vice versa? Because my Trinity assembly file is a fixed file (so I can make it a database) but my genes of interest collection may change several time every day (and each time I must run another round of "make blast database" procedure) !

ADD REPLY • link 7.8 years ago by Farbod ★ 3.4k

1

Entering edit mode

Hm...It's just a guess, but: You could replace spaces with '_' in your genes_of_interest file (using e.g., sed). If you then use -outfmt '6 sseqid qseqid' your output should look like:

TRINITY_DN158974_c0_g1_i1 --------------------- gi|283137712|gb|GQ924783.1|_Dicentrarchus_labrax_insulin_growth_factor_1_gene,_complete_cds

ADD REPLY • link 7.8 years ago by dschika ▴ 320

1

Entering edit mode

Hi dear, it works for me, thank you!

By the way is there any way (some manual or topic link) to learn the options similar to "6 stitle qseqid" about ncbi blast ?

ADD REPLY • link 7.8 years ago by Farbod ★ 3.4k

1

Entering edit mode

Great!

At NCBI there is something. I guess there is also a section somewhere describing the outfmt option.

ADD REPLY • link 7.8 years ago by dschika ▴ 320