Combine the fastq files
0
0
Entering edit mode
3.0 years ago

Hello,

I have four whole genome sequencing data files which two files are reverse and two files forward, Right? (Attached in the image):

I want to map these files using the bwa software to the reference genome. Should i combine two reverse files together, and also combine two forward files together? If I should combine, which files should I combine (According to the picture)? And what command line should I use for combine?

Or no need to combine? How to map them to the reference genome?

SNP • 1.8k views
0
Entering edit mode

I am not sure why someone chose to change the R1/R2 nomenclature for read files to L1/L2 but you can combine R1 files together and R2 files together. You can't do them by clicking on the names in that view you have of some GUI. You will have to do that on the command line by using cat R1_file1.fq.gz R1_file2.fq.gz > R1_one.fq.gz.

Aligning them in pieces and then combining the resulting BAM file is a valid option too.

0
Entering edit mode

I expect L1/L2 are lanes (renamed from L001 and L002, though I haven't a clue whey they did that).

0
Entering edit mode

For my samples combine L1_1/L1_2 together and L2_1/L2_2 together, Right?

0
Entering edit mode

If what @Devon says is right i.e. L1_1is actually L001_R1 and L2_1 is L002_R1 then you would combine L1_1 with L2_1. Then L1_2 with L2_2.

0
Entering edit mode

Since these are not fasta files based on their names, I edited the title of the post to accurately reflect file type.

Also see: mapping form different lane

0
Entering edit mode

If you want to do variant calling with GATK (maybe other tools?), it is important to map them separately and add relevant read group information to each file.

0
Entering edit mode

Is there a lane component to the read group?

1
Entering edit mode

Yes and no (and take into account I barely did variant calling, my knowledge is second-hand). Lane is a possible source of technical variation, so if the information is present, it is used for base recalibration. One can add lane using the PU flag. Even if you don't add the PU flag, one should make the RG flag unique for same sample on different lanes. Here is a snippet from the GATK forum:

PU = Platform Unit

The PU holds three types of information, the {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. The {FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell. The {LANE} indicates the lane of the flow cell and the {SAMPLE_BARCODE} is a sample/library-specific identifier. Although the PU is not required by GATK but takes precedence over ID for base recalibration if it is present. In the example shown earlier, two read group fields, ID and PU, appropriately differentiate flow cell lane, marked by .2, a factor that contributes to batch effects.

0
Entering edit mode

Thanks. I wonder how relevant this is these days. Do people even do base recalibration anymore? I too haven't done much in the way of variant calling lately, in case you can't tell :)

0
Entering edit mode

yes, i want to use GATK for variant calling. So, should I now map each file separately to the reference genome? I do not understand exactly what you mean? Or, in other words, I do not know exactly which files I want to map first?

Is this my command line correct?

for L1_1/L2_1 :

bwa mem -t 12 -M -R "@RG\tID: BBKHU01_F\tLB: BBKHU01_ F\tPL:ILLUMINA\tSM: BBKHU01_F" /home/m.rafiepour222/GCF_000471725.1_UMD_CASPUR_WB_2.0_genomic.fa /home/m.rafiepour222/BBKHU01_F/BBKHU01_F_D17073416_H353KDMXX_L1_1.clean.fq.gz /home/m.rafiepour222/BBKHU01_F/BBKHU01_F_D17073416_H353KDMXX_L2_1.clean.fq.gz > BBKHU01_F.sam_1


and for L1_2/L2_2 :

bwa mem -t 12 -M -R "@RG\tID: BBKHU01_F\tLB: BBKHU01_ F\tPL:ILLUMINA\tSM: BBKHU01_F" /home/m.rafiepour222/GCF_000471725.1_UMD_CASPUR_WB_2.0_genomic.fa /home/m.rafiepour222/BBKHU01_F/BBKHU01_F_D17073416_H353KDMXX_L1_2.clean.fq.gz /home/m.rafiepour222/BBKHU01_F/BBKHU01_F_D17073416_H353KDMXX_L2_2.clean.fq.gz > BBKHU01_F.sam_1


If it is not correct, please tell me which command line should I use?

0
Entering edit mode

Ok, So i once map L1_1 / L2_1 into the reference genome and once again L1_2 / L2_2, Just like the command line you edited for me , right?

Then convert the output of each step in the SAM file to the BAM file and then combine the two BAM files together? Am I correct?