NGS analysis: how to handle paired-end reads
3
5
Entering edit mode
6.0 years ago
m98 ▴ 420

I am learning how to analyse NGS data. I have a data for 192 samples. These were obtained through a targeted sequencing library prep.

I have 192 samples, but technically I have received 2 files for each sample. For example:

  • sample1_TTGCCTT_L008_R1_001.fastq.gz
  • sample1_TTGCCTT_L008_R2_001.fastq.gz

Presumably the reason there are 2 files is because paired-end sequencing was performed. I've been reading around on the various steps of NGS analysis but I can't seem to find an answer to the following question:

How to handle paired-end reads? Do you do have to merge them and if so when? Before alignment presumably? Also, do I have to uncompress the fastq.gz before I do anything? I am very new to NGS so apologies if these are really basic questions. Thanks.

ngs paired-end reads analysis pipeline • 8.9k views
ADD COMMENT
5
Entering edit mode
6.0 years ago
GenoMax 141k

You don't need to merge the R1/R2 reads. You don't say what kind of data this is but generally if you are aligning to a reference then you would use the two files together with an NGS aligner. Since the files contain reads from the same fragment their alignment to a reference provides spatial information.

All extant NGS tools should understand gzipped files. You should not need to decompress then during analysis (note: there may be some exceptions depending on very specific programs).

(An aside: If the reads are longer than the 1/2 size of the insert then they can overlap in the middle. )

Reads will overlap in this case

|------------------------------>100 bp|    R1 - 150 bp
|-------------------------------------|    Fragment 250 bp
|100 bp<------------------------------|    R2 - 150 bp

and will not here

|-------->                            |    R1 - 100 bp
|-------------------------------------|    Fragment 350 bp
|                           <---------|    R2 - 100 bp
ADD COMMENT
1
Entering edit mode

And if the reads are longer than the fragment then you'll sequence through the fragment into the adapters. This is why many pipelines include an adapter trimming step.

ADD REPLY
3
Entering edit mode
6.0 years ago

Keep them separated, like Nicolas said, most modern NGS software should handle paired-end reads. One they're aligned, you should have a single SAM/BAM file containing reads from both ends.

Depending on your purpose, you may need to choose different tools. For variant calling using whole-genome-sequencing data, I used bwa mem for aligning the reads. A good resource would be the Broad's Best Practices Guideline, which would cover the alignment step (note what version the the Guideline you're using; I've used the one for GATK 3.0, and they recently updated to 4.0 so I can't comment on the latest one.)

For RNA-Seq, if you have Illumina short reads, you probably want a splice-aware aligner in order to detect cases like a read spanning an exon and part of a retained intron. I like STAR personally, and HISAT2 is also popular and a bit more recent one.

And .gzipped files are often supported; with STAR you simply specify --readFilesCommand zcat

My RNA-Seq pipeline uses STAR + RSEM for quantification of genes/transcripts.

ADD COMMENT
2
Entering edit mode
6.0 years ago

Most of the modern tools for NGS (e.g. aligners) handle paired-end fastq.gz files. Just give them as input.

For example bwa mem :

bwa mem reference sample1_TTGCCTT_L008_R1_001.fastq.gz sample1_TTGCCTT_L008_R2_001.fastq.gz > alignment.sam
ADD COMMENT

Login before adding your answer.

Traffic: 1606 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6