Question

multiple nucleotide sequences align to multiple protein sequences

0

Entering edit mode

9.4 years ago

Kurban ▴ 230

i have two fasta file , one of them contain nucleotide sequences (more than 100000), another one contain multiple polypeptide sequences (more than a 1000). i wanna search nucleotide sequences which could be aligned to these protein sequences in the protein sequences file.
i am new at this , if any one could give any suggestion little bit in detail would be appreciated.

alignment • 2.9k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by Kurban ▴ 230

score 0 · Answer 1 · 2014-12-14

0

Entering edit mode

9.4 years ago

Ram 43k

You might just wanna create a database with makeblastdb out of one file and BLAST the other against it, with blastx or tblastnbased on the db and the query..

ADD COMMENT • link 2.1 years ago by Ram 43k

Ram · Answer 2 · 2014-12-14

0

Entering edit mode

9.4 years ago

Pierre Lindenbaum 161k

Use blast. http://www.ncbi.nlm.nih.gov/books/NBK1763/

Compile your proteins sequence file with makeblastdb.

Search the new database with blastx ("The "blastx" application translates a nucleotide query in six frames and searches it against a protein database. ")

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by Pierre Lindenbaum 161k

Ram · Answer 3 · 2014-12-15

First of all, I wanna thank @RamRS and @Pierre Lindenbaum,

You guys gave me very useful suggestions several times here, thank you.

I did blastx my query file ( the file include more than 140000 nucleotides sequences) to db file (the file include more than1400 polypeptide sequences ), and I got my result. but the generated file is around 1.4GB. and I checked the blast result, it shows that most of the query sequences aligned to the the at least one of the db sequences( but not all of them with high e value and score):

Query= comp1896_c0_seq1 len=2039 path=[0:0-259 2272:260-284 285:285-2038]

Length=2039
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

  FBpp0083843 FBgn0028684 symbol:Tbp-1 family:Transcription Cofac...   152    5e-41
  FBpp0081704 FBgn0040078 symbol:pont family:Chromatin Remodeling...  44.3    1e-05
  FBpp0074756 FBgn0040075 symbol:rept family:Chromatin Remodeling...  33.5    0.037
  FBpp0099511 FBgn0004913 symbol:Gnf1 family:Transcription Cofact...  31.2    0.22
  Tribolium_TF472                                                     26.9    2.9
  Tribolium_TF80                                                      26.9    3.7

I think first two hit might be the result I want, right? Then how can I screen that results from not ideal ones?

kurban@kurban-X550VC:~/Desktop/tf$ blastx -help
USAGE
  blastx [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-entrez_query entrez_query]
    [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]
    [-subject subject_input_file] [-subject_loc range] [-query input_file]
    [-out output_file] [-evalue evalue] [-word_size int_value]
    [-gapopen open_penalty] [-gapextend extend_penalty]
    [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value]
    [-max_hsps_per_subject int_value] [-max_intron_length length]
    [-seg SEG_options] [-soft_masking soft_masking] [-matrix matrix_name]
    [-threshold float_value] [-culling_limit int_value]
    [-best_hit_overhang float_value] [-best_hit_score_edge float_value]
    [-window_size int_value] [-ungapped] [-lcase_masking] [-query_loc range]
    [-strand strand] [-parse_deflines] [-query_gencode int_value]
    [-outfmt format] [-show_gis] [-num_descriptions int_value]
    [-num_alignments int_value] [-html] [-max_target_seqs num_sequences]
    [-num_threads int_value] [-remote] [-comp_based_stats compo]
    [-use_sw_tback] [-version]

If it can be done by changing this:

blastx -query gene.fa -out tf.blastx -db TFs.fasta

how should I change?