How to align a sequence with a pair of sequence DNA?
2
0
Entering edit mode
7.6 years ago
midox ▴ 290

How to align a short reads (paired-end) over a sequence. while the short reads file is in two files i.e. two sequences which represent the short reads?

Example:

Sequence:   AGTGGTAGCTGCCTG
|||||||
CCATCGA


in Computer science how alignment?

Was a string (sequence) with two chains of chars (short reads)

Thank you

sequence-analysis alignment Assembly • 3.3k views
0
Entering edit mode

I want to do a program that receives as input a sequence of short read (paired end) and a DNA sequence and I want to make aligenement of these two sequences but I do not know how to do an alignment because I have two sequences of short read (paired end) so I want to know how to make an alignment of two sequences. Because you can make the alignment with two sequences and I have 3 sequences (paired ends and the DNA sequence).

Have you an idea?

Thank you

0
Entering edit mode

As tsr640 said, the example reads you gave are reverse compliments of each other not paired-end reads. If that's correct, you could just align them to the reference using something like the Smith-Waterman or Needleman-Wunsch algorithms. Both of those wiki pages give good examples of how sequence alignments work computationally, and then you could just try it out yourself - either coding it in your favourite language, or perhaps using one of the implementations from the EBI.

For paired-end reads, the actual alignment will probably be similar, but there may be checks for distance between pairs, orientation of reads, that sort of thing, because you are expecting the pairs to be coming from the same short fragment of the sample DNA.

0
Entering edit mode

Tthank you for your explanation but in reality are paired end reads.

So there are pairs of reads and I my data in pairs. do you have an idea for an alignment of paired end reads?

Thank you

1
Entering edit mode

Here's a naive approach that may help you understand what others have already mentioned. I think there are more issues that you need to be aware of when looking at this problem. Suppose you have 1 set of paired-end reads (2 different sequences), then you can align each sequence separately to your reference DNA sequence. After both sequences have been aligned, check to see if the distance between their alignments is less than some threshold value that you specify (this is called the insert size). For example, if two aligned sequences are <100 bases apart from each other, then that is a good alignment. There are smarter ways to do this (as explained by tsr640) but my approach is probably the most basic if you have to implement something for a class. Otherwise look around the internet and read about various read aligners (Bowtie, BWA, Tophat, STAR to name a few). Pick your favorite and learn how to use it well.

0
Entering edit mode

I can do an alignment between two sequences is simple.

But my problem is how to make an alignment of a read (paired end) and a sequence as an example:

Sequence:   AGTGGTAGCTGCCTG
|||||||
CCATCGA


Here I have a sequence and a read (paired end).

I did not understand how to align a sequence and a paired reads

Thank you

0
Entering edit mode

What do you want to know specifically? What type of algorithm is used for alignment? How you write a tool using that algorithm? What tools can do it? What the difference is between aligning paired end vs. single end? How that translates to alignment?

I can maybe explain a little better, or add some sort of figure. No one will give you a custom made script or tool and asking the same question again doesn't really help in understanding what you want to know. Please explain some more what your hurdles are and what you really want to accomplish and why you do not want to use existing tools?

0
Entering edit mode

I think you have not understood my question.

I did not need a script and I know there are alignment tools but I'm looking for how to do a simple alignment of two sequences and one of these two sequences is paired end.

Do I take the first strand of the read (paired end) or something else to make the alignment?

Because the alignment is right between two sequences.

Thanks

0
Entering edit mode

If you want to know about algorithms used for alignment, read the wiki links I posted previously and Google "sequence alignment tutorial" or something.

You might create less confusion over the question if you presented your reads differently. This:

GGTAGCT
|||||||
CCATCGA


looks like you're representing an alignment or complementary base pairing (particularly since, you know, they are reverse complements...). Perhaps something more like

read 1: GGTAGCT read 2: CCATCGA


would have made it more obvious?

2
Entering edit mode
7.6 years ago
Lesley Sitter ▴ 580

What do you want to do with it? Is it an assignment for school? Or do you want to create a sam file, in that case Bowtie2 can handle paired end reads just fine

bowtie2-build <YOUR_SEQUENCE> <any_name_will_do_database>
bowtie2 -p 10 -x <any_name_will_do_database> -1 <Short-Read1> -2 <Short-Read2> | samtools view -b -S - | samtools sort - <sorted_bam_output_name>


What you can also do is merge reads (if possible) using PEAR (PairEnd reAd mergeR) to merge the pairs into contigs (if they overlap), and then just align that to your sequence. It all depends on what you want to achieve

Also it seems you picture is wrong, paired end is not reverse-compliment, but its the beginning and end of a molecule so it works like this

So your example would rather look like this;

sequence:                     AGTGGTAGCTGCCTGCCATCGA
|||||||        |||||||
short forward read:           AGTGGTA
short reverse read:                          CCATCGA

1
Entering edit mode
7.6 years ago
Lesley Sitter ▴ 580

Well, as far I know some aligners first align the forward read using regular alignment algorithms like @13en mentioned. Then it looks within a range (which is your estimated insert size + some extra just to be sure) and tries to align the reverse strand in that region.

What I also have seen in some tools is that they just align all forward and all reverse reads to the query seperately, then filter our the paired reads and match their coordinates. The coordinates that qualify your arguments (for example insert size is 200 should eliminate all alignments that are more than 200-(read lengths of both reads) apart from each other) and then only pick the ones that are considered proper. And output them in SAM format or something.

If you take the PEAR method, take a read pair, look for overlap, output the PEAR contigs, then align the PEAR contigs to your query sequence.

But it still all depends on what you want to actually accomplish with your software. I would suggest just reading up on what other pair end mappers/aligners do and then deciding what you want to improve. Or if you have a specific research question, then first finding out what you need as output to answer it, then find out what the best course is to answer it.

Best advice I can give is, don't try to reinvent the wheel. Some of these tools have years of research put into them so it won't be easy to just write one on the go unless you really want to put some time into it.