Question: paired end illumina reads
0
gravatar for midox
3.6 years ago by
midox220
Tunisia
midox220 wrote:

Hello,
I want to know if paired end sequencing we have two files and R2.fq and R1.fq.
do R1.fq is the first strand of DNA and the other R2.fq is the second strand of DNA?
and how the overlap between the reads? do we must merge the two reads of R1.fq and R2.fq in one reads?


Thank you

paired end assembly • 4.7k views
ADD COMMENTlink modified 3.3 years ago by jomo018480 • written 3.6 years ago by midox220
1
gravatar for andynkili
3.6 years ago by
andynkili10
andynkili10 wrote:

You should specify a little bit what king of paired reads you have (SOLID, Illumina, ...).

about merging : it depends on you what you're going to do with the reads (what kind of analysis)? In my point of view read mapping and assembly are more accurate with paired end reads instead of merged reads, but again it depends on what you want to do

ADD COMMENTlink written 3.6 years ago by andynkili10

I want to do an assembly but before I have to understand the reads of the two files.

for example: I have two files and R2.fq R1.fq

r1.1   ---->........................<----  r1.2                                                          
                           r2.1    ---->........................<----   r2.2                                    
  r3.1  ---->........................<----   r3.2                                                               
                                                            rX.1      ---->........................<---- rX.2   

how to assembly in this case?

thankyou

ADD REPLYlink written 3.6 years ago by midox220

All your rx.1 reads are your forward reads stored in one file and rX.2 are the reverse ones. You can use an assembler (SPAdes, Ray,...) and specify on command line which file contains the reverse and the forward reads. But choosing the assembler is related to the type organism you are studyig, some assembly tools performs better on viruses others on bacterial organism. Before going into any assembly step, did you pre process your reads? any filtering? any trimming?

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by andynkili10

Thanks for your help.
Yes I know I can pass them to an assembler.
But I want to understand the assembly when it was paired end reads (two files).
for example, if I want to overlapping reads do I take the foward and the reverse reads and I'm looking overlap?
or do i make the reverse-complement of R2.fq?
thanks

ADD REPLYlink written 3.6 years ago by midox220

Just give your reads to an assembler. Don't apply your preprocessing unless the assembler is requiring so. If you want to understand how paired-end assembly works, read the paper of the assembler in use. Different assemblers work in different ways. Choose your assembler first, and then ask questions.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by lh331k

Thanks for your help.
I'm not talking assemblers.
I used assemblers but me, if I want to create an assembler I want to understand the basics. and especially, the types of paired end reads files.
because I have confusion about paired end reads.
In the case of a single file, I can do the overlap between reads of this file.
but in the case paired end reads (2 files) and I have to find the overlap between reads these two files??
Thank you

ADD REPLYlink written 3.6 years ago by midox220
1
gravatar for Kamil
3.6 years ago by
Kamil1.9k
Boston
Kamil1.9k wrote:

You might consider learning more about RNA-seq before you begin any analysis.

Try starting with RNA-seqlopedia, a comprehensive reference that covers every step of RNA-seq:

1. Experimental Design
2. RNA Preparation
3. Library Preparation
4. Sequencing
5. Analysis
6. References

To answer your particular question about paired reads, try the section on paired-end sequencing.

ADD COMMENTlink written 3.6 years ago by Kamil1.9k

Thank you for your help.
I'm not talking RNA. I have DNA data and i want to do an assembly but before I need to understand the R1.fq R2.fq files and to know how to make the overlap.

thanks

ADD REPLYlink written 3.6 years ago by midox220
1
gravatar for Antonio R. Franco
3.6 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.0k wrote:

With paired end, you sequence both ends of a same shotgun fragment. Is just as you have drawn. The middle part, represented in your schemes as dots, remains unknown

Assembling principles does not differ too much on the assembling of single end sequencing (where only a single end of the shotgun fragment is sequenced), but take into account the information of both ends at the same time

This is highly advantageous. With the double of sequenced ends by each of the shotgun fragments, you simply have doubled the amount of information for the assembling. The fact that both paired sequences from the same fragment is sequenced, and is separated by a known distance, include restrictions to the assemblers and allow a better assignment of the sequences in the final dratf of the genome

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by Antonio R. Franco4.0k

Thank you for your explanations.
So I have both R1.fq and R2.fq files, I just find the overlap between reads in the two files without make the reverse-complement of R2.fq? I use both the file as they are?

thankyou

ADD REPLYlink written 3.6 years ago by midox220
1

Exactly. The assemblers will take care of it. But you need to inform the program that this is paired data. Most of times is by simply writing in the command lane one file followed by the next

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Antonio R. Franco4.0k

Thanks for your help. I'm not talking assemblers.
I know that I can inform the assembler type my files.
But me, for example, if I want to create an assembler.
I wanted to know the paired end reads. I want to understand the paired reads fonctionnality.
tell me if I'm wrong, so for example if I will assemble paired reads I must take both R1.fq and R2.fq file and I find the overlap between the reads of the two files ??
I'm no expert but I want to understand the basics. I have read many articles but there is no article that addresses such questions.

ADD REPLYlink written 3.6 years ago by midox220
1

When you sequence with the new NGS (Next Generation Sequencing) systems, you break down the DNA or the cDNA into pieces (the shotgun pieces) and run an step in which you select for a determined size (let's say, 600 bases plus a standard variation, that is plus/minus 50 bases)

Then you know by sequencing 100 bases of both ends, that there is a central part of the sequence that is approximately 400+/-50 bases long whose sequence remains unknown. And this is not a big deal, because that region will be covered by other shotgun sequences as long as the break down of many other DNA sequences have been done at random. Simply by chance, other sequenced ends will be covering that space

You only have to change your mind a little bit. The assembler will try to fit everything looking for overlapping taking into account these pieces. If only a end is sequenced, I know that this is easy for you to understand.

Now you need to consider that you have actually a block of paired sequences of the same DNA fragment  separated by a certain distance that needs to play the same game 

Every end, can be saved into a different file. But the internal name of each read will allow the assembler to recognize which is its corresponding mate in the other file, if present. That is why you can have two separate files

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Antonio R. Franco4.0k

thank you for your explanations.
so if we have:
r1.1 ----> ........................ <---- R1.2
                               r2.1 ----> ........................ <---- R2.2
   r3.1 ----> ........................ <---- R3.2
                                                             rX.1 ----> ........................ <---- rX.2

here was an overlap between r1.1 and r3.1 therefore can assemble them.
and we also overlap between R1.2 and r2.1.
In this case, it was an overlap between the reads from F1.fq and reads from F2.fq. Do we assemble in the same way as for r1.1 and r3.1 ?? (without create the reverse complement of R1.2? we take the sequence as in the file?)
Thank you

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by midox220
1

I don't see in your example that r1.2 overlaps with r2.1

I see that r2.1 overlaps with  the reverse of r3.2, and this latter with r1.2. So, a contig can be formed with r2.1, and the reverse of r3.2 and the reverse of r1,2 in this order

 

ADD REPLYlink written 3.6 years ago by Antonio R. Franco4.0k

Okay.
So to make an assembly, if we has paired end we must make the reverse of rX.2 reads, so the second file to find the overlap between all reads (two files)?

it is necessary to have the reads in the same direction

r1.1   ---->........................---->  r1.2                                                          
                           r2.1    ---->........................---->   r2.2                                    
  r3.1  ---->........................---->   r3.2                                                               
                                                            rX.1      ---->........................----> rX.2   
am I right?
thanks

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by midox220

any response for this please?
thanks

 

ADD REPLYlink written 3.6 years ago by midox220

any respense please!?

ADD REPLYlink written 3.6 years ago by midox220
1
gravatar for Antonio R. Franco
3.6 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.0k wrote:

One more thing. SOme assemblers require the you hold your paired reads in a single fastq file. Most often, they are separated files

If you look into this Wiki on Fastq you will note that in every read, it is contained the spatial assignation (=read location) of the sequence in the sequencer that allow ro recognize which ends are paired, the presence or not of barcodes, and whether the sequence is the first read or its mate with an /1 and /2 respectively

ADD COMMENTlink written 3.6 years ago by Antonio R. Franco4.0k
1
gravatar for jomo018
3.3 years ago by
jomo018480
jomo018480 wrote:
As for your original questions, from my own experience: 1) r1 holds reads from both strands and so does r2. The only thing you know is that line x in r1 corresponds with line x in r2, each on a different strand. 2)You can align each read against a genome reference. In that case, you need to c-reverse half the data. You do not have apriori information which read to reverse. You need to align both ways. 3) You can merge first than align. You would still need to check both alignment directions.
ADD COMMENTlink written 3.3 years ago by jomo018480

in this case, how can we know the relations between the paired-end reads to find the relation between the contigs after assembly to create scaffolds??
 

ADD REPLYlink written 3.2 years ago by midox220
1

Assemblers use information about paired end sequences to form contigs

Gaps and repeated sequences are the responsible for the appearance of many contigs

But assemblers cannot join or organize contigs into scaffolds once the assembly is done. They simply cannot do it, and this is why you end with many different contigs

To organize contigs into scaffolds you need a different strategy like the using of mate-paired reads, long sequencing reads (PacBio, Long Illumina, Nanopore) or a comparison with a trusted genome

ADD REPLYlink written 3.2 years ago by Antonio R. Franco4.0k

how I can have the mate-paired reads ??
we can not make the scaffolding with only the paired end reads?
THanks

ADD REPLYlink written 3.2 years ago by midox220
1

Look information in this forum and into Illumina web pages about mate paired which is a different kind of paired sequences in which both reads from a same fragment retain long distance genomic information (several kb usually)

Assemblers cannot go beyond forming contigs because the limitations of short shotgun sequences and the presence of repeated sequences (hard to manage) and gaps. If an assembler using paired-end sequencing is giving you different contigs is because it cannot go beyond that. Assembling with second generation sequencers that are using short shotgun technology are far from perfect and it is very limited. An assembler trying to assemble a simple 4,5Mb E.coli genome with a 100X coverage of Illumina reads can provide you between 150 to 250 different contigs

This is why you need to overcome this limitations with mate paired, long sequencing or, if possible, the comparison and ordering of contigs using a trusted reference geneme if it is available

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by Antonio R. Franco4.0k

thank you for your reply.
 So with paired-end reads I have to download the pair mate??
 

ADD REPLYlink written 3.2 years ago by midox220
1

If you download paired-end reads you are downloading both mates from a single fragment. If you download a paired-end sequence, you ust be downloading the sequences from both extremes of a same fragment. You can also be provided by a single end sequence, in which only of the two ends is being sequenced

But don't get confuse. Every read from a paired end are named mates because they are related one to the other by the fact that are both extremes from a same fragment separated for less than 300 a 500bp

A mate-paired is however different.  This fragment involves a completely different protocol to obtain it and correspond to sequences from a unique fragment which is several kilobases long. So the mates into a mate-paired fragment is separated by several kb

Check into the Illumina page for the protocols to obtain paired-end and mater-paired sequences

If you see useful these sequences, acknoledge them by voting. If you see close the subject, do the same, so people can be alerted that this contain useful information. No votes, no interest in reading this..

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by Antonio R. Franco4.0k

I downloaded the paired end reads with length 300bp but I have no mate pairs.

How i do in this case??

ADD REPLYlink written 3.2 years ago by midox220
1
You have not many choices. You can only assemble with these paired end data Use different assemblers such as Mira (OLP method) and Velvet, Spades (Der Bruijn) and try to compare, measure several assembly indexes such as N50, etc
ADD REPLYlink written 3.2 years ago by Antonio R. Franco4.0k

but if I want to do a scaffolding?
 how to do in this case, do I have to download the mate-pair file?
 do I get a link to download the mate-pair for E.coli?
thanks

ADD REPLYlink written 3.2 years ago by midox220
1

Yo need to prepare your mate-pairs from the same genome you are sequencing as paired-end

If you want to improve the assembly, you have a hard task. The only think I believe you can do is use different assemblers with the hope that one is better than the other. Usa Mira, and velvet, spades, etc

ADD REPLYlink written 3.2 years ago by Antonio R. Franco4.0k

No, i want to do a scaffolding. I built contigs from an assembler but it is necessary to make the scaffolding and I do not want use a scaffolding tool, I want to do it alone but I do not know how!!!

Based on your advice you told me to use the mate-pairs but I do not know how to have the mate-pair?!! this my problem.

Yo need to prepare your mate-pairs from the same genome you are sequencing as paired-end

 I have not prepared the paired-end reads but I downloaded the files.
how to do for mate pairs??
Thanks

ADD REPLYlink written 3.2 years ago by midox220
1

You need to prepare your mate-paired at the time you are preparing your paired-end fragments.. If you check the Illumina information, you can notice that both follow a different protocol

If you don't have mate-paired sequences, you can't do nothing. You need to prepare mate-paired at the same time with the same genome, and sequence everything at the same time 

Yo don't mention what is your genome. Maybe you can have a trusted reference genome to compare with programs like Mauve that allow you to organize your contigs into scaffolds

ADD REPLYlink written 3.2 years ago by Antonio R. Franco4.0k

my genomes is E-Coli and S.cerevisae W303.

ADD REPLYlink written 3.2 years ago by midox220
1

These two organisms have nice and trusted genomes you can use as reference.

Download and install Mauve, read its instructions, and use the tool of organize the contigs resulting from the assembly using a comparison with these reference genomes

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by Antonio R. Franco4.0k

ok thank you

ADD REPLYlink written 3.2 years ago by midox220
0
gravatar for midox
3.3 years ago by
midox220
Tunisia
midox220 wrote:

Hello,
 I have some question please.
 How can we know the "known distance" between paired-end reads?
 and if we have two files Foward and Reverse, do reverse is always the reverse complement strand??
 we can not find reverse complement reads in the Foward file??
thanks

ADD COMMENTlink written 3.3 years ago by midox220
1

Known distance is usually a best guess based on how long the target sequences are supposed to be: library insert size or length from timed PCR.

Yes, reverse is from the reverse complement. Here's a video that describes Illumina paired end sequencing:

 

ADD REPLYlink written 3.3 years ago by anp375160

how to know the distance? i have just paireed end file??
 

ADD REPLYlink written 3.2 years ago by midox220

If you look here, inside this forum, you can get the answer

To know the distance, you need to map these reads to a reference and get the SAM/BAM mapping file

There are some tools that allow you to discover the distance between reads mates

ADD REPLYlink written 3.2 years ago by Antonio R. Franco4.0k

have you any references please?

ADD REPLYlink written 3.2 years ago by midox220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1932 users visited in the last hour