Question

Split paired-end fastq files by lane

0

Entering edit mode

23 months ago

zeina-younes • 0

I have fastq files for different samples generated from novaseq. For each sample there are two fastq files R1 and R2. I want to split those fastq files into 8 files (L1_R1, L2_R1, L3_R1, L4_R1 and same for R2). Any idea how to do it? The name structure is as follows: SampleIdentifier_sampleID_laneID_pairID.fastq.gz

Fastq split novaseq • 1.7k views

ADD COMMENT • link updated 23 months ago by seidel 11k • written 23 months ago by zeina-younes • 0

0

Entering edit mode

This is bit of an odd request and some clarification is needed.

Do you want to split the sample files (which are already demultiplexed?) into lane specific data files (L*) or do you have a pair of files containing more than one sample that you want to split into sample + lane specific files?

If your files are already named like SampleIdentifier_sampleID_laneID_pairID.fastq.gz then they are probably already in sample + lane format.

ADD REPLY • link 23 months ago by GenoMax 141k

0

Entering edit mode

I am sorry i am not a bioinformatician and i am not very familiar with the technical terms that's why my question is not clear. The sample files are demultiplexed. My sample files are named:

SampleIdentifier_SampleID_pairID_001.fastq.gz (not sure what does the 001 at the end stand for since all the samples name end with 001, SampleIdentifier_S1_R1_001.fastq.gz and SampleIdentifier_S1_R2_001.fastq.gz) and i want to split them into sample and lane specific --> SampleIdentifier_SampleID_LaneID_pairID_001.fastq.gz (SampleIdentifier_S1_L1_R1_001.fastq.gz, SampleIdentifier_S1_L2_R1_001.fastq.gz, SampleIdentifier_S1_L3_R1_001.fastq.gz, SampleIdentifier_S1_L4_R1_001.fastq.gz and SampleIdentifier_S1_L1_R2_001.fastq.gz, SampleIdentifier_S1_L2_R2_001.fastq.gz, SampleIdentifier_S1_L3_R2_001.fastq.gz, SampleIdentifier_S1_L4_R2_001.fastq.gz).

I hope my question is more cleae now

ADD REPLY • link updated 23 months ago by cpad0112 21k • written 23 months ago by zeina-younes • 0

1

Entering edit mode

not sure what does the 001 at the end stand for

That is a standard part of all illumina files. Ages ago data used to be put in 2M read chunks per file and then that number incremented for each file.

If you are able to ask your sequencing provider to provide lane specific files then it will be easiest thing for you. Otherwise you are going to need to come up with a script (perhaps a clever awk may work) to split the data into lane specific files.

@EAS139:136:FC706VJ:**2**:2104:15343:197393 1:Y:18:ATCACG

Number I highlighted (with **) in the fastq header above tells which lane the read is from.

I am not sure why you want to do this though since lane specific files have no advantage. There is little lane bias in sequencing.

ADD REPLY • link 23 months ago by GenoMax 141k

0

Entering edit mode

Post is confusing. Please post example output and expected output.

ADD REPLY • link 23 months ago by cpad0112 21k

0

Entering edit mode

Your fastq files should each have the same number of lines. Each record is 4 lines. So you should be able to split them into smaller chunks using the linux split command. Assuming uncompressed files, count the lines and choose a number that is evenly divisible by 4.

split -l NumberofLines SampleIdentifier_SampleID_LaneID_pairID_001.fastq SampleIdentifier_SampleID_LaneID_pairID_001_part_

this will split your file into separate files based on the number you specify. You can choose a numeric suffix using -d. Make sure the number you choose results in each output file being evenly divisible by 4. Repeat for the second read file.

ADD REPLY • link 23 months ago by seidel 11k

0

Entering edit mode

@seidel I am moving your answer to a comment since OP wants to split the files in lane specific manner, which your suggestion will not do.