Question: NGS analysis: how to handle paired-end reads
3
gravatar for m93
11 months ago by
m93140
m93140 wrote:

I am learning how to analyse NGS data. I have a data for 192 samples. These were obtained through a targeted sequencing library prep.

I have 192 samples, but technically I have received 2 files for each sample. For example:

  • sample1_TTGCCTT_L008_R1_001.fastq.gz
  • sample1_TTGCCTT_L008_R2_001.fastq.gz

Presumably the reason there are 2 files is because paired-end sequencing was performed. I've been reading around on the various steps of NGS analysis but I can't seem to find an answer to the following question:

How to handle paired-end reads? Do you do have to merge them and if so when? Before alignment presumably? Also, do I have to uncompress the fastq.gz before I do anything? I am very new to NGS so apologies if these are really basic questions. Thanks.

ADD COMMENTlink modified 11 months ago by manuel.belmadani570 • written 11 months ago by m93140
4
gravatar for genomax
11 months ago by
genomax64k
United States
genomax64k wrote:

You don't need to merge the R1/R2 reads. You don't say what kind of data this is but generally if you are aligning to a reference then you would use the two files together with an NGS aligner. Since the files contain reads from the same fragment their alignment to a reference provides spatial information.

All extant NGS tools should understand gzipped files. You should not need to decompress then during analysis (note: there may be some exceptions depending on very specific programs).

(An aside: If the reads are longer than the 1/2 size of the insert then they can overlap in the middle. )

Reads will overlap in this case

|------------------------------>100 bp|    R1 - 150 bp
|-------------------------------------|    Fragment 250 bp
|100 bp<------------------------------|    R2 - 150 bp

and will not here

|-------->                            |    R1 - 100 bp
|-------------------------------------|    Fragment 350 bp
|                           <---------|    R2 - 100 bp
ADD COMMENTlink modified 11 months ago • written 11 months ago by genomax64k
1

And if the reads are longer than the fragment then you'll sequence through the fragment into the adapters. This is why many pipelines include an adapter trimming step.

ADD REPLYlink written 11 months ago by d-cameron2.0k
3
gravatar for manuel.belmadani
11 months ago by
Canada
manuel.belmadani570 wrote:

Keep them separated, like Nicolas said, most modern NGS software should handle paired-end reads. One they're aligned, you should have a single SAM/BAM file containing reads from both ends.

Depending on your purpose, you may need to choose different tools. For variant calling using whole-genome-sequencing data, I used bwa mem for aligning the reads. A good resource would be the Broad's Best Practices Guideline, which would cover the alignment step (note what version the the Guideline you're using; I've used the one for GATK 3.0, and they recently updated to 4.0 so I can't comment on the latest one.)

For RNA-Seq, if you have Illumina short reads, you probably want a splice-aware aligner in order to detect cases like a read spanning an exon and part of a retained intron. I like STAR personally, and HISAT2 is also popular and a bit more recent one.

And .gzipped files are often supported; with STAR you simply specify --readFilesCommand zcat

My RNA-Seq pipeline uses STAR + RSEM for quantification of genes/transcripts.

ADD COMMENTlink modified 11 months ago • written 11 months ago by manuel.belmadani570
2
gravatar for Nicolas Rosewick
11 months ago by
Belgium, Brussels
Nicolas Rosewick7.4k wrote:

Most of the modern tools for NGS (e.g. aligners) handle paired-end fastq.gz files. Just give them as input.

For example bwa mem :

bwa mem reference sample1_TTGCCTT_L008_R1_001.fastq.gz sample1_TTGCCTT_L008_R2_001.fastq.gz > alignment.sam
ADD COMMENTlink modified 11 months ago • written 11 months ago by Nicolas Rosewick7.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1002 users visited in the last hour