Hi all,
I am trying to replicate GATK WDL commands on bash for Novaseq whole genome sequencing data of humans. I ran scatter-gather python script provided by GATK. I was able to produce 18 different files as GATK team suggested. But when I am running the BaseRecalibrator step, with genomic intervals option (files generated from scatter-gather), BaseRecalibrator is not accepting these files and giving error: Invalid file and not in proper format. So, I went the other way around and used the WGS.intervals_list provided by GATK reference bundle GRCh38. I am able to run BaseRecalibrator with the WGS.intervals_list and it generated a single recalibration report. So, my confusion is:
Will it effect the overall output of my GATK pipeline if I am not using scatter-gather step and instead directly using WGS.intervals_list provided by GATK reference bundle GRCh38 for BaseRecalibrator? Other than speed (parallelism).
Why I am not able to use the scatter-gather output files for the BaseRecalibrator step?
What if I do not provide genomic interval list in -L option in BaseRecalibrator step? Will the recal_table be different from the one generated when provided with WGS.intervals_list provided by GATK reference bundle GRCh38?
For your reference, I am attaching here the scatter-gather script. Pardon me if my questions are too naive. I just want to understand the basics. I have searched everywhere for this.
> import sys
with open("/Users/AARTI/Desktop/hg38.dict", "r") as ref_dict_file:
sequence_tuple_list = []
longest_sequence = 0
for line in ref_dict_file:
if line.startswith("@SQ"):
line_split = line.split("\t")
# (Sequence_Name, Sequence_Length)
sequence_tuple_list.append((line_split[1].split("SN:")[1], int(line_split[2].split("LN:")[1])))
longest_sequence = sorted(sequence_tuple_list, key=lambda x: x[1], reverse=True)[0][1]
print(sequence_tuple_list[0])
print(longest_sequence)
# We are adding this to the intervals because hg38 has contigs named with embedded colons and a bug in GATK strips off
# the last element after a :, so we add this as a sacrificial element.
hg38_protection_tag = ":1+"
max_line = 0
# initialize the tsv string with the first sequence
tsv_string = sequence_tuple_list[0][0]
temp_size = sequence_tuple_list[0][1]
for sequence_tuple in sequence_tuple_list[1:]:
if temp_size + sequence_tuple[1] <= longest_sequence and max_line <= 1600:
temp_size += sequence_tuple[1]
if not ":" in sequence_tuple[0]:
tsv_string += "\t" + sequence_tuple[0]
else:
tsv_string += "\t" + sequence_tuple[0] + hg38_protection_tag
max_line = max_line + 1
else:
if not ":" in sequence_tuple[0]:
tsv_string += "\n" + sequence_tuple[0]
else:
tsv_string += "\n" + sequence_tuple[0] + hg38_protection_tag
temp_size = sequence_tuple[1]
max_line = 0
# add the unmapped sequences as a separate line to ensure that they are recalibrated as well
tsv_string += '\n' + "unmapped"
sequence_groups_unmapped = tsv_string.split("\n")
for i in range(0, len(sequence_groups_unmapped)):
with open("sequence_grouping_with_unmapped_{0}.txt".format(i), "w") as tsv_file_with_unmapped:
tsv_file_with_unmapped.write(sequence_groups_unmapped[i])
tsv_file_with_unmapped.close()
Any help pls!!