Question

Illlumina Paired End Reads File Format

2

Entering edit mode

13.4 years ago

Travis ★ 2.8k

Hi all,

I should be receiving several million PE reads from multiple samples/lanes soon and I am wondering what format the files take.

I know they will be FASTQ but I am wondering do they generally come as one sample per file, one lane per file or something else? Also do the paired ends come in the same or different files?

I plan to align with BWA and it looks like it expects separate files for the paired ends. Is this correct? If samples/lanes/ends need to be separated into individual files, is there a standard way of doing this?

Thanks in advance.

next-gen sequencing paired • 9.3k views

ADD COMMENT • link updated 13.3 years ago by Sean Davis 27k • written 13.4 years ago by Travis ★ 2.8k

score 4 · Answer 1 · 2011-05-25

4

Entering edit mode

13.4 years ago

Pierre Lindenbaum 164k

For illumina, you should receive two fastq files (_1.fastq and _2.fastq) having the same number of reads in each file. The elements of each pair are have the same index.

[?]

ADD COMMENT • link 13.4 years ago by Pierre Lindenbaum 164k

2

Entering edit mode

Travis: yes, if you do paired-end sequencing, you get two files. The naming depends on technology or company, e.g. we get file names called 123456_s_N_[12]_lib.txt, where the first number is a serial number, N is lane (I believe), [12] is 1 or 2, i.e which end, and lib is a designation for the library used. But the contents is like Pierre says.

ADD REPLY • link 13.4 years ago by Ketil 4.1k

0

Entering edit mode

Thanks! Are there two files per sample?

ADD REPLY • link 13.4 years ago by Travis ★ 2.8k

score 3 · Answer 2 · 2011-05-25

3

Entering edit mode

13.4 years ago

Sean Davis 27k

Hi, Travis.

You'll want to be in touch with the sequencing center providing the sequencing service. They will not likely combine lanes of data into samples if samples are run in multiple lanes; if they do, you should ask them to split them up again or do so yourself. If there are multiple samples per lane (multiplexed), you or they will need to split based on the index barcode. In general, you will want to learn about the SAM/BAM format and Read Groups so that you can keep track of various units of data such as library, sample, and lane as you move through downstream analyses. The importance of doing so will depend on the scientific application (pretty important for variant calling but perhaps not so much so for gene expression)....

Sean

ADD COMMENT • link 13.4 years ago by Sean Davis 27k

0

Entering edit mode

A great help. Any good references to learn about SAM/BAM and tracing the data units through sample/

ADD REPLY • link 13.4 years ago by Travis ★ 2.8k

0

Entering edit mode

A great help. Any good references to learn about SAM/BAM and tracing the data units through sample/lane/etc

ADD REPLY • link 13.4 years ago by Travis ★ 2.8k

0

Entering edit mode

The samtools site (http://samtools.sourceforge.net) is a good place to look for sam-specific information. In particular, the sam format is described here: http://samtools.sourceforge.net/SAM1.pdf. The GATK website is a good place to learn about DNA sequence data analysis, though those tools might not always be the best ones for the job.

ADD REPLY • link 13.4 years ago by Sean Davis 27k

0

Entering edit mode

I have already downloaded samtools and GATK and done some background reading but I keep getting myself hung up on small details :) I guess I really just need to generate some dummy reads and run through a couple of workflows with BWA/Samtools and GATK.

ADD REPLY • link 13.4 years ago by Travis ★ 2.8k

0

Entering edit mode

Sean why do you need to keep the data from seperate lanes in seperate files? Isn't it OK to combine data from seperate lane basd file into file.1.fastq and file.2.fastq (forward and reverse) and perform alignment on those?

ADD REPLY • link 12.8 years ago by Biomed 5.0k