Question: Illlumina Paired End Reads File Format
2
gravatar for Travis
8.0 years ago by
Travis2.8k
USA
Travis2.8k wrote:

Hi all,

I should be receiving several million PE reads from multiple samples/lanes soon and I am wondering what format the files take.

I know they will be FASTQ but I am wondering do they generally come as one sample per file, one lane per file or something else? Also do the paired ends come in the same or different files?

I plan to align with BWA and it looks like it expects separate files for the paired ends. Is this correct? If samples/lanes/ends need to be separated into individual files, is there a standard way of doing this?

Thanks in advance.

paired next-gen sequencing • 5.8k views
ADD COMMENTlink modified 7.8 years ago by Sean Davis25k • written 8.0 years ago by Travis2.8k
4
gravatar for Pierre Lindenbaum
8.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

For illumina, you should receive two fastq files (_1.fastq and _2.fastq) having the same number of reads in each file. The elements of each pair are have the same index.

[?]

ADD COMMENTlink written 8.0 years ago by Pierre Lindenbaum120k
2

Travis: yes, if you do paired-end sequencing, you get two files. The naming depends on technology or company, e.g. we get file names called 123456_s_N_[12]_lib.txt, where the first number is a serial number, N is lane (I believe), [12] is 1 or 2, i.e which end, and lib is a designation for the library used. But the contents is like Pierre says.

ADD REPLYlink written 8.0 years ago by Ketil3.9k

Thanks! Are there two files per sample?

ADD REPLYlink written 8.0 years ago by Travis2.8k
3
gravatar for Sean Davis
8.0 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

Hi, Travis.

You'll want to be in touch with the sequencing center providing the sequencing service. They will not likely combine lanes of data into samples if samples are run in multiple lanes; if they do, you should ask them to split them up again or do so yourself. If there are multiple samples per lane (multiplexed), you or they will need to split based on the index barcode. In general, you will want to learn about the SAM/BAM format and Read Groups so that you can keep track of various units of data such as library, sample, and lane as you move through downstream analyses. The importance of doing so will depend on the scientific application (pretty important for variant calling but perhaps not so much so for gene expression)....

Sean

ADD COMMENTlink written 8.0 years ago by Sean Davis25k

A great help. Any good references to learn about SAM/BAM and tracing the data units through sample/

ADD REPLYlink written 8.0 years ago by Travis2.8k

A great help. Any good references to learn about SAM/BAM and tracing the data units through sample/lane/etc

ADD REPLYlink written 8.0 years ago by Travis2.8k

The samtools site (http://samtools.sourceforge.net) is a good place to look for sam-specific information. In particular, the sam format is described here: http://samtools.sourceforge.net/SAM1.pdf. The GATK website is a good place to learn about DNA sequence data analysis, though those tools might not always be the best ones for the job.

ADD REPLYlink written 8.0 years ago by Sean Davis25k

I have already downloaded samtools and GATK and done some background reading but I keep getting myself hung up on small details :) I guess I really just need to generate some dummy reads and run through a couple of workflows with BWA/Samtools and GATK.

ADD REPLYlink written 8.0 years ago by Travis2.8k

Sean why do you need to keep the data from seperate lanes in seperate files? Isn't it OK to combine data from seperate lane basd file into file.1.fastq and file.2.fastq (forward and reverse) and perform alignment on those?

ADD REPLYlink written 7.3 years ago by Biomed4.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2006 users visited in the last hour