Assemble Reads According To Blast Result
1
2
Entering edit mode
9.2 years ago
biolab ★ 1.4k

Dear all, I performed blast using ~200 gene sequences from reference organism against a draft genome dataset from related species ( containing 12million short reads(sr), each of which is 88bp). The following examplifies the blast result.

----------length, strand + /-, e-value ommited, some numbers are wrong---------------

sr-ID    GeneID    Iden.    Start(sr)    End(sr)    Start(gene)    End(gene)
Sxxx001    Gene1    100    1    36    2302    2334
Sxxx002    Gene1    98    1    75    313    348
Sxxx004    Gene1    100    3    43    481    519
Sxxx001    Gene2    100    8    78    2140    2172
Sxxx006    Gene2    97    2    88    280    312
Sxxx007    Gene3    100    1    56    862    897
Sxxx008    Gene3    100    6    78    2020    2055
Sxxx009    Gene3    100    5    77    3934    3972


I can get each short read sequence ranging from start(sr) to end (sr) . However, next step is troublsome; I need to assemble short reads to longer sequence according to the corresponding gene hit. e.g., need to assemble Sxxx001, Sxxx002, Sxxx004 according to Gene1; and assemble Sxxx001 and Sxxx006 according to Gene2, etc.

This analysis aims to identify the homologous genes from the draft genome data. Could anyone help to describe ways to assembling sequences according to blast result?

software assembly blast • 2.5k views
0
Entering edit mode

If I get you right, your hits do not overlap: I.e. for Gene1

Sxxx001    Gene1    100    1    36    2302    2334
Sxxx002    Gene1    98    1    75    313    348
Sxxx004    Gene1    100    3    43    481    519


this means it looks like:

pos   : 313-------------348-------------------481--------------519-----------/-/-------------- 2302-----------2334



so there is no overlap between your hits, am I wrong?

0
Entering edit mode

Hi, Phil, Sorry for my unclear description. It's not my real result, just an example. You need not to think of the start/end positions. These short reads can overlap. What I need to do is to assemble these reads according to the corresponding genes. The following is my revised table (x indicates number)

sr-ID    GeneID    Iden.    Start(sr)    End(sr)    Start(gene)    End(gene)
Sxxx001    Gene1    100    y    yy    yyy    yyyy
Sxxx002    Gene1    98    y    yy    yyy   yyy
Sxxx004    Gene1    100    y   yy  yyy    yyy
Sxxx001    Gene2    100    y    yy   yyyy   yyyy
Sxxx006    Gene2    97    y    yy    yyy    yyy
Sxxx007    Gene3    100    yy    yy   yyy  yyy
Sxxx008    Gene3    100    y    yy    yyy    yyy
Sxxx009    Gene3    100    y    yy    yyy    yyyyy


In my original post, I am wrong saying to extract reads ranging from start to end. I just need to assemble these reads according to the gene hits. Maybe use de novo approach? I will much appreciate your suggestions. THANKS.

1
Entering edit mode
9.2 years ago
Ying W ★ 4.2k

I am guessing you want to figure out if the reads from related organism align to known genes from reference organism. I don't think BLAST is the right tool at this step for you to use. You may consider doing something like this instead:

1. Assemble short reads using a dedicated assemblers such as cap3 (if your dataset is small, this paper might be useful as reference) or velvet (if your data is genome DNA, comparison paper might have more up to date methods)
2. Then BLAST the 200 genes of interest from reference organism against the assembled reads (contigs)

This way you can look at differences in genes in your related organism versus the reference organism and have less reads that overlap genes multiple times.

0
Entering edit mode

THANKS a lot for your advices. However, i still think the blast method is feasible. First, the draft genome data contains 12 million reads, all of which are 88 bp. This length is suitable for Blast. Second, I only focus on ~200 genes. Extracting reads homologous to these genes is easy. Then I can perform de novo assembling using these reads. I am not arguing, just for sienctific discussion. What's your ideas? THANKS!

0
Entering edit mode

The reason why I would not start w/BLAST is that the downstream stuff gets a bit complicated. After BLAST, you now have many reads that overlap your genes of interest but you might want to know a) where in the gene does the reads overlap b) are there regions that the reads have that the genes don't c) are there regions that the genes have that the reads don't. You can try to connect the BLAST results to give you this information but I think it could be more complicated than just running a denovo assembler. Another thing you can try is run a sequence mapper (like bwa or bowtie) and use your 200 genes as reference sequence (200 chromosomes) and map/overlap your reads with that. It would make for easier visualization too.

0
Entering edit mode

THANKS! It's really helpful. One more point I need to clarify. You said b) are there regions that the reads have that the genes don't c) are there regions that the genes have that the reads don't. Can this be done by mapping the reads, which have been obtained after BLAST, to the 200 genes? For your towtie mapping method, I just worry some ture reads homologous to the 200 genes may be omitted (because these reads are 88 bp, 8 mismatch may still be homologous, can bowtie tolerate it? BLAST can). In contrast BLAST can't omit any homologous reads. Is that true? What's ideas? Really THANKS!

0
Entering edit mode

from bowtie2 manual

-N <int>

Sets the number of mismatches to allowed in a seed alignment during multiseed alignment. Can be set to 0 or 1. Setting this higher makes alignment slower (often much slower) but increases sensitivity. Default: 0.


from bowtie manual:

The -n alignment mode

When the -n option is specified (which is the default), bowtie determines which alignments are valid according to the following policy, which is similar to Maq's default policy.

Alignments may have no more than N mismatches (where N is a number 0-3, set with -n) in the first L bases (where L is a number 5 or greater, set with -l) on the high-quality (left) end of the read. The first L bases are called the "seed".


8 Mismatches isn't doable with Botwie. I'm not too sure but I think all, at least the vast majority of short read mappers do not allow 8 mismatches!

Do you need a bash/shell script or something in a particular language?

0
Entering edit mode

My current plan is to perform blast firstly (this can identify homologous reads), and then carry out de novo assembly by cap3. It's not referece species, so de novo appraoch is suitable i think. If anyone has any comments and suggestions, pls feel free to present. THANKS.

0
Entering edit mode

I am not sure how bowtie will perform if you are looking for so many mismatches. How bowtie works is that it will take 20bp out of your 88bp read (seed) and try and match that somewhere. The 20bp chosen must match w/very little differences (that's what the -n command is for). You could end up w/many differences in the whole read but the seed cannot have too many. bowtie will choose seeds multiple times for each read. There are many parameters here that could be adjusted. The b) can be inferred from blast but is messy because you need to keep track of what part of each read never matches BLAST (on that note, say one end of your 88bp reads matches but the other end doesn't, unless you assemble you don't know how long the region that doesn't match is). for c) it is doable w/BLAST but you would need to go through every single result and keep track of regions that are or aren't overlapped. one thing that you might be interested in is looking for mutations between the novel and reference organism and BLAST will show this info in the alignment, but parsing through all that is tricky.

0
Entering edit mode

Hi Ying W, THANKS a lot for your suggestions and comments, which are very helpful. I will try bowtie for reference based assembling to see how things work.