Question: Subject Sequence Id from Blast results
0
gravatar for zgayk
3.5 years ago by
zgayk90
United States
zgayk90 wrote:

 

Hello,

 

I'm just learning how to use blast from the command line. I used blastn to search contigs in our genome assembly against a local database of the chicken genome. I ran two separate runs one with -outfmt 0 (pairwise) and -outfmt 7 (tabular with comment lines).

What I would like to do is extract the dna alignments of each query and subject sequence and put these in a file with query and subject sequence ids. Becuase I am very new to blast from the command line I have two questions:

(1) Instead of extracting the pairwise alignments from the outfmt 0 and  sequence ids from outfmt 7, would it be easier to make some custom blast command that returns both pieces of information in one database? If so any suggestions on how to do this?

(2) Results from both blast runs only show numbers (e.g. 20) for the subject sequence ids matching to our assembly in the chicken genome. The fasta formatted chicken assembly I am using does have a separate .gff file that does have subject sequence ids, but I can't figure out how to add the .gff file to the chicken blast database I made. I don't think the makeblastdb command supports .gff files so I don't know how to extract the subject ids from this. Any ideas you have are greatly appreciated. 

Thanks,

Zach

 

 

 

blast • 2.1k views
ADD COMMENTlink written 3.5 years ago by zgayk90

Sorry zgayk, but I am not sure of want are you trying to do. 

You want to recover information which is in the outfmt 0 OR outfmt 7 but not both? What is the purpose of your data extraction? (to understand better) :)

ADD REPLYlink written 3.5 years ago by glihm590

Hello,

On thinking about it, I am only concerned with the second question in my original post. I have blast results in the outmt 0 that I want to extract all of the aligned query and subject sequences from. I also need the sequence IDS so that I know what each query and subject sequences refers to. Ideally, I would like to get the gene name for the subject sequences becuase the main point of this blast run is to retrieve orthologous gene sequences matching between our genome assembly of a different bird species and the chicken genome (subject match). The main problem has been the way the subject matches in the chicken genome assembly are labelled. In the outfmt 0 results, subject matches are simply labelled with a number (chromosome), making them not useful at all to see which chicken gene my scaffolds are matching to (in this case the subject match is 26):

Query= scaffold_0

 

Length=1506

                                                                      Score     E

Sequences producing significant alignments:                          (Bits)  Value

 

26                                                                    300     5e-79

 

 

>26

Length=5102438

 

 Score = 300 bits (332),  Expect = 5e-79

 Identities = 447/634 (71%), Gaps = 65/634 (10%)

 Strand=Plus/Minus

 

Query  304     TGTCTTAAATAGGCACTTGGTATGGAAGAAACTTGTGCATACACCCTAGTGTGGCAGGGG  363

               ||||  |||| |||| | || |||||||| ||||||||||||| |||||    |||||||

Sbjct  256300  TGTCCCAAATGGGCAATGGGCATGGAAGACACTTGTGCATACAGCCTAGCA--GCAGGGG  256243

 

Query  364     AAGAGCTGGAAACAGGACTTACGTCTTCAGACATGCTCAAACACATTCCTCCTCCTGGGC  423

               ||||  |||||| ||| | |||||||||   |||||||||||||||||||| |||||||

Sbjct  256242  AAGAATTGGAAATAGGCCCTACGTCTTCCAGCATGCTCAAACACATTCCTCATCCTGGGT  256183

 

Query  424     CAACGCATGCCAGAGAGAGTGCATAGCACAGTGGAGGTCTTTCCCACATTGGTGCCCTGG  483

               || ||||||||||           ||||||  |   | || ||||||||  ||||    |

Sbjct  256182  CAGCGCATGCCAG-----------AGCACACAGACAGCCTCTCCCACATCAGTGCATCAG  256134

 

Query  484     AAGGGGAGTGTGGAGGGGAGAGGTGGGTGTCTCACCAGGAAGGTTGCAGGTGCAGGTGGT  543

                || |||  |||| |||||||||| |||||| |||||||              ||||| |

Sbjct  256133  GAGAGGAACGTGG-GGGGAGAGGT-GGTGTCCCACCAGG--------------AGGTGCT  256090

 

Query  544     TCTGCGCTGTGCAGGACAAATGATCCAGGGCATGTTGTTTCTCCAGCGCCTCTCTTGTGC  603

               ||||  ||| | ||||||||  |||| |||||     ||||  ||||  | |||||||||

        A more useful sequence id for the chicken genome seems to be in the associated .gff file, where for example the numer identifier on the left also has a sequence identifier such as GALT00000025000 which I can actually search for. My main question is how to parse the blast output to get meaningful sequence identifiers if blast can't search a .gff file. Is there an easier way to get the chicken subject sequence ids, or will I have to extract the information from the .gff file and then match it to the number sequence identifiers in the chicken blast output?                                                                        

2    GeneWise        mRNA    118032794    118035699    98.24   +    .    ID=GALT00000025000;

2    GeneWise        CDS     118032794    118032874    .    +    0    Parent=GALT00000025000;

2    GeneWise        CDS     118034101    118035699    .    +    0    Parent=GALT00000025000;

ADD REPLYlink written 3.5 years ago by zgayk90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1149 users visited in the last hour