Entering edit mode
23 months ago
zeina-younes
•
0
I have fastq files for different samples generated from novaseq. For each sample there are two fastq files R1 and R2. I want to split those fastq files into 8 files (L1_R1, L2_R1, L3_R1, L4_R1 and same for R2). Any idea how to do it? The name structure is as follows: SampleIdentifier_sampleID_laneID_pairID.fastq.gz
This is bit of an odd request and some clarification is needed.
Do you want to split the sample files (which are already demultiplexed?) into lane specific data files (
L*
) or do you have a pair of files containing more than one sample that you want to split into sample + lane specific files?If your files are already named like
SampleIdentifier_sampleID_laneID_pairID.fastq.gz
then they are probably already insample + lane
format.I am sorry i am not a bioinformatician and i am not very familiar with the technical terms that's why my question is not clear. The sample files are demultiplexed. My sample files are named:
SampleIdentifier_SampleID_pairID_001.fastq.gz (not sure what does the 001 at the end stand for since all the samples name end with 001, SampleIdentifier_S1_R1_001.fastq.gz and SampleIdentifier_S1_R2_001.fastq.gz) and i want to split them into sample and lane specific --> SampleIdentifier_SampleID_LaneID_pairID_001.fastq.gz (SampleIdentifier_S1_L1_R1_001.fastq.gz, SampleIdentifier_S1_L2_R1_001.fastq.gz, SampleIdentifier_S1_L3_R1_001.fastq.gz, SampleIdentifier_S1_L4_R1_001.fastq.gz and SampleIdentifier_S1_L1_R2_001.fastq.gz, SampleIdentifier_S1_L2_R2_001.fastq.gz, SampleIdentifier_S1_L3_R2_001.fastq.gz, SampleIdentifier_S1_L4_R2_001.fastq.gz).
I hope my question is more cleae now
That is a standard part of all illumina files. Ages ago data used to be put in 2M read chunks per file and then that number incremented for each file.
If you are able to ask your sequencing provider to provide lane specific files then it will be easiest thing for you. Otherwise you are going to need to come up with a script (perhaps a clever
awk
may work) to split the data into lane specific files.Number I highlighted (with
**
) in the fastq header above tells which lane the read is from.I am not sure why you want to do this though since lane specific files have no advantage. There is little lane bias in sequencing.
Post is confusing. Please post example output and expected output.
Your fastq files should each have the same number of lines. Each record is 4 lines. So you should be able to split them into smaller chunks using the linux split command. Assuming uncompressed files, count the lines and choose a number that is evenly divisible by 4.
this will split your file into separate files based on the number you specify. You can choose a numeric suffix using -d. Make sure the number you choose results in each output file being evenly divisible by 4. Repeat for the second read file.
@seidel I am moving your answer to a comment since OP wants to split the files in
lane specific
manner, which your suggestion will not do.No problem. I edited the question to be more specific in that regard :)