Question: Combine the fastq files
0
gravatar for mostafarafiepour
3 months ago by
mostafarafiepour0 wrote:

Hello,

I have four whole genome sequencing data files which two files are reverse and two files forward, Right? (Attached in the image):

enter image description here

I want to map these files using the bwa software to the reference genome. Should i combine two reverse files together, and also combine two forward files together? If I should combine, which files should I combine (According to the picture)? And what command line should I use for combine?

Or no need to combine? How to map them to the reference genome?

snp • 248 views
ADD COMMENTlink modified 3 months ago by genomax54k • written 3 months ago by mostafarafiepour0

I am not sure why someone chose to change the R1/R2 nomenclature for read files to L1/L2 but you can combine R1 files together and R2 files together. You can't do them by clicking on the names in that view you have of some GUI. You will have to do that on the command line by using cat R1_file1.fq.gz R1_file2.fq.gz > R1_one.fq.gz.

Aligning them in pieces and then combining the resulting BAM file is a valid option too.

ADD REPLYlink written 3 months ago by genomax54k

I expect L1/L2 are lanes (renamed from L001 and L002, though I haven't a clue whey they did that).

ADD REPLYlink written 3 months ago by Devon Ryan82k

many thanks for your reply,

For my samples combine L1_1/L1_2 together and L2_1/L2_2 together, Right?

ADD REPLYlink written 3 months ago by mostafarafiepour0

If what @Devon says is right i.e. L1_1is actually L001_R1 and L2_1 is L002_R1 then you would combine L1_1 with L2_1. Then L1_2 with L2_2.

ADD REPLYlink written 3 months ago by genomax54k

Since these are not fasta files based on their names, I edited the title of the post to accurately reflect file type.

Also see: mapping form different lane

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax54k

If you want to do variant calling with GATK (maybe other tools?), it is important to map them separately and add relevant read group information to each file.

ADD REPLYlink written 3 months ago by h.mon18k

Is there a lane component to the read group?

ADD REPLYlink written 3 months ago by Devon Ryan82k
1

Yes and no (and take into account I barely did variant calling, my knowledge is second-hand). Lane is a possible source of technical variation, so if the information is present, it is used for base recalibration. One can add lane using the PU flag. Even if you don't add the PU flag, one should make the RG flag unique for same sample on different lanes. Here is a snippet from the GATK forum:

PU = Platform Unit

The PU holds three types of information, the {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. The {FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell. The {LANE} indicates the lane of the flow cell and the {SAMPLE_BARCODE} is a sample/library-specific identifier. Although the PU is not required by GATK but takes precedence over ID for base recalibration if it is present. In the example shown earlier, two read group fields, ID and PU, appropriately differentiate flow cell lane, marked by .2, a factor that contributes to batch effects.

ADD REPLYlink written 3 months ago by h.mon18k

Thanks. I wonder how relevant this is these days. Do people even do base recalibration anymore? I too haven't done much in the way of variant calling lately, in case you can't tell :)

ADD REPLYlink written 3 months ago by Devon Ryan82k

yes, i want to use GATK for variant calling. So, should I now map each file separately to the reference genome? I do not understand exactly what you mean? Or, in other words, I do not know exactly which files I want to map first?

Is this my command line correct?

for L1_1/L2_1 :

bwa mem -t 12 -M -R "@RG\tID: BBKHU01_F\tLB: BBKHU01_ F\tPL:ILLUMINA\tSM: BBKHU01_F" /home/m.rafiepour222/GCF_000471725.1_UMD_CASPUR_WB_2.0_genomic.fa /home/m.rafiepour222/BBKHU01_F/BBKHU01_F_D17073416_H353KDMXX_L1_1.clean.fq.gz /home/m.rafiepour222/BBKHU01_F/BBKHU01_F_D17073416_H353KDMXX_L2_1.clean.fq.gz > BBKHU01_F.sam_1

and for L1_2/L2_2 :

bwa mem -t 12 -M -R "@RG\tID: BBKHU01_F\tLB: BBKHU01_ F\tPL:ILLUMINA\tSM: BBKHU01_F" /home/m.rafiepour222/GCF_000471725.1_UMD_CASPUR_WB_2.0_genomic.fa /home/m.rafiepour222/BBKHU01_F/BBKHU01_F_D17073416_H353KDMXX_L1_2.clean.fq.gz /home/m.rafiepour222/BBKHU01_F/BBKHU01_F_D17073416_H353KDMXX_L2_2.clean.fq.gz > BBKHU01_F.sam_1

If it is not correct, please tell me which command line should I use?

ADD REPLYlink modified 3 months ago by genomax54k • written 3 months ago by mostafarafiepour0

Ok, So i once map L1_1 / L2_1 into the reference genome and once again L1_2 / L2_2, Just like the command line you edited for me , right?

Then convert the output of each step in the SAM file to the BAM file and then combine the two BAM files together? Am I correct?

ADD REPLYlink modified 3 months ago • written 3 months ago by mostafarafiepour0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1599 users visited in the last hour