Entering edit mode
6.0 years ago
mostafarafiepour
▴
160
Hello,
I have four whole genome sequencing data files which two files are reverse and two files forward, Right? (Attached in the image):
I want to map these files using the bwa software to the reference genome. Should i combine two reverse files together, and also combine two forward files together? If I should combine, which files should I combine (According to the picture)? And what command line should I use for combine?
Or no need to combine? How to map them to the reference genome?
I am not sure why someone chose to change the
R1/R2
nomenclature for read files toL1/L2
but you can combine R1 files together and R2 files together. You can't do them by clicking on the names in that view you have of some GUI. You will have to do that on the command line by usingcat R1_file1.fq.gz R1_file2.fq.gz > R1_one.fq.gz
.Aligning them in pieces and then combining the resulting BAM file is a valid option too.
I expect L1/L2 are lanes (renamed from L001 and L002, though I haven't a clue whey they did that).
many thanks for your reply,
For my samples combine L1_1/L1_2 together and L2_1/L2_2 together, Right?
If what @Devon says is right i.e.
L1_1
is actuallyL001_R1
andL2_1
isL002_R1
then you would combineL1_1
withL2_1
. ThenL1_2
withL2_2
.Since these are not
fasta
files based on their names, I edited the title of the post to accurately reflect file type.Also see: mapping form different lane
If you want to do variant calling with GATK (maybe other tools?), it is important to map them separately and add relevant read group information to each file.
Is there a lane component to the read group?
Yes and no (and take into account I barely did variant calling, my knowledge is second-hand). Lane is a possible source of technical variation, so if the information is present, it is used for base recalibration. One can add lane using the PU flag. Even if you don't add the PU flag, one should make the RG flag unique for same sample on different lanes. Here is a snippet from the GATK forum:
Thanks. I wonder how relevant this is these days. Do people even do base recalibration anymore? I too haven't done much in the way of variant calling lately, in case you can't tell :)
yes, i want to use GATK for variant calling. So, should I now map each file separately to the reference genome? I do not understand exactly what you mean? Or, in other words, I do not know exactly which files I want to map first?
Is this my command line correct?
for L1_1/L2_1 :
and for L1_2/L2_2 :
If it is not correct, please tell me which command line should I use?
Ok, So i once map L1_1 / L2_1 into the reference genome and once again L1_2 / L2_2, Just like the command line you edited for me , right?
Then convert the output of each step in the SAM file to the BAM file and then combine the two BAM files together? Am I correct?