hello everyone
I have a question about the blast. I admit that I do not understand everything.
I have been asked to blastx an fsa file of arabidopsis thaliana sequences against an oak gene model. In order to see if there were any matching sequences between the two species:
My data is formatted like this:
Reference.fsa
>Qrob_P00010.2 69
ATGTCTGGCCCTGAAAA........
Fasta file arabidopsis:
>AT3G25210.1 | Symbols: | Tetratricopeptide repeat (TPR)-like superfamily protein | chr3:9180348-9181487 FORWARD LENGTH=1140
ATGTCGGCGACACTCCGACGCCTCATTCTTCTCACC..............
When I wanted to do a blastx, I first made my reference a protein database with these commands:
"makeblastdb -in ref.fsa -dbtype prot -blastdb_version 5 -parse_seqids"
"blastx -query fasta_arabido.fsa -db ref.fsa -out ara.txt "
However when I do my blastx, I get no hits.
Query= AT3G25210.1 | Symbols: | Tetratricopeptide repeat (TPR)-like
superfamily protein | chr3:9180348-9181487 FORWARD LENGTH=1140
Length=1140
***** No hits found *****
Lambda K H a alpha
0.318 0.134 0.401 0.792 4.96
I saw that I could try to make a tblastx but by making my reference a nucleic database and I get the results below which seems correct.
"makeblastdb -in ref.fsa -dbtype nucl -blastdb_version 5 -parse_seqids"
"tblastx -query fasta_arabido.fsa -db ref.fsa -out ara.txt "
ara.txt
Query= AT3G25210.1 | Symbols: | Tetratricopeptide repeat (TPR)-like
superfamily protein | chr3:9180348-9181487 FORWARD LENGTH=1140
Length=1140
Score E
Sequences producing significant alignments: (Bits) Value N
Qrob_P0702440.2 1323 514 9e-146 1
>Qrob_P0702440.2 1323
Length=1323
Score = 514 bits (1116), Expect = 9e-146
Identities = 204/312 (65%), Positives = 259/312 (83%), Gaps = 0/312 (0%)
Frame = +1/+1
Query 157 RTRTPLETQFETWIQNLKPGFTNSDVVIALRAQSDPDLALDIFRWTAQQRGYKHNHEAYH 336
R++T LETQFETW+QNLKPGFT SDV L +QSDPDLALD+FRWT QRGY H H Y
Sbjct 190 RSKTQLETQFETWVQNLKPGFTPSDVEHTLWSQSDPDLALDLFRWTTLQRGYTHTHATYF 369
Query 337 TMIKQAITGKRNNFVETLIEEVIAGACEMSVPLYNCIIRFCCGRKFLFNRAFDVYNKMLR 516
T+IK ++ KR ETLIEEV++GAC++++PLYN II+FCC ++ LFNRAFDVY KM
Sbjct 370 TIIKILVSNKRYGLAETLIEEVLSGACDINIPLYNYIIKFCCDKRSLFNRAFDVYKKMYN 549
Query 517 SDDSKPDLETYTlllssllKRFNKLNVCYVYLHAVRSLTKQMKSNGVIPDTFVLNMIIKA 696
S++ KP+L+TY++L + LL+RFNKLNVCY+YL + +SL+KQMK+ GVIPDT+VLNMIIKA
Sbjct 550 SENCKPNLQTYSMLFNLLLRRFNKLNVCYMYLQSAKSLSKQMKAAGVIPDTYVLNMIIKA 729
Query 697 YAKCLEVDEAIRVFKEMALYGSEPNAYTYSYLVKGVCEKGRVGQGLGFYKEMQVKGMVPN 876
Y+KCLEVDEAIRVF+EM LYG EPNAYTY Y+VKG+CEKGRVGQG GFY+EM+ KG+VP+
Sbjct 730 YSKCLEVDEAIRVFREMGLYGCEPNAYTYGYMVKGLCEKGRVGQGFGFYEEMKGKGLVPS 909
Query 877 GSCYMVLICSLSMERRLDEAVEVVYDMLANSLSPDMLTYNTVLTELCRGGRGSEALEMVE 1056
S YM+LICSL++ERR ++A+ VV+DML N + PD+LTY T+L LCR GRG+EA E+++
Sbjct 910 SSSYMILICSLALERRFEDAIGVVFDMLGNFMGPDLLTYKTLLEGLCREGRGNEAFELLD 1089
Query 1057 EWKKRDPVMGER 1092
E +KRD M E+
Sbjct 1090 ELRKRDRSMSEK 1125
I don't understand the real difference between the blastx and the tblastx one is based on a protein database and the other on a nucleic database but is it because my reference file is in nucleotide I could not make a protein database?
Did I do the right thing according to you?
Thank you in advance for your answer
Have a nice day
Aka
Thank you very much, it's really clear now !