Question

BaseRecalibration step GATK

0

Entering edit mode

3.8 years ago

AR • 0

Hi all,

I am trying to replicate GATK WDL commands on bash for Novaseq whole genome sequencing data of humans. I ran scatter-gather python script provided by GATK. I was able to produce 18 different files as GATK team suggested. But when I am running the BaseRecalibrator step, with genomic intervals option (files generated from scatter-gather), BaseRecalibrator is not accepting these files and giving error: Invalid file and not in proper format. So, I went the other way around and used the WGS.intervals_list provided by GATK reference bundle GRCh38. I am able to run BaseRecalibrator with the WGS.intervals_list and it generated a single recalibration report. So, my confusion is:

Will it effect the overall output of my GATK pipeline if I am not using scatter-gather step and instead directly using WGS.intervals_list provided by GATK reference bundle GRCh38 for BaseRecalibrator? Other than speed (parallelism).
Why I am not able to use the scatter-gather output files for the BaseRecalibrator step?
What if I do not provide genomic interval list in -L option in BaseRecalibrator step? Will the recal_table be different from the one generated when provided with WGS.intervals_list provided by GATK reference bundle GRCh38?

For your reference, I am attaching here the scatter-gather script. Pardon me if my questions are too naive. I just want to understand the basics. I have searched everywhere for this.

> import sys
with open("/Users/AARTI/Desktop/hg38.dict", "r") as ref_dict_file:
    sequence_tuple_list = []
    longest_sequence = 0
    for line in ref_dict_file:
        if line.startswith("@SQ"):
            line_split = line.split("\t")
            # (Sequence_Name, Sequence_Length)
            sequence_tuple_list.append((line_split[1].split("SN:")[1], int(line_split[2].split("LN:")[1])))
    longest_sequence = sorted(sequence_tuple_list, key=lambda x: x[1], reverse=True)[0][1]
    print(sequence_tuple_list[0])
    print(longest_sequence)
# We are adding this to the intervals because hg38 has contigs named with embedded colons and a bug in GATK strips off
# the last element after a :, so we add this as a sacrificial element.
hg38_protection_tag = ":1+"
max_line = 0
# initialize the tsv string with the first sequence
tsv_string = sequence_tuple_list[0][0]
temp_size = sequence_tuple_list[0][1]
for sequence_tuple in sequence_tuple_list[1:]:
    if temp_size + sequence_tuple[1] <= longest_sequence and max_line <= 1600:
        temp_size += sequence_tuple[1]
        if not ":" in sequence_tuple[0]:
            tsv_string += "\t" + sequence_tuple[0]
        else:
            tsv_string += "\t" + sequence_tuple[0] + hg38_protection_tag
        max_line = max_line + 1
    else:
        if not ":" in sequence_tuple[0]:
            tsv_string += "\n" + sequence_tuple[0]
        else:
            tsv_string += "\n" + sequence_tuple[0] + hg38_protection_tag
        temp_size = sequence_tuple[1]
        max_line = 0
# add the unmapped sequences as a separate line to ensure that they are recalibrated as well
tsv_string += '\n' + "unmapped"
sequence_groups_unmapped = tsv_string.split("\n")
for i in range(0, len(sequence_groups_unmapped)):
    with open("sequence_grouping_with_unmapped_{0}.txt".format(i), "w") as tsv_file_with_unmapped:
        tsv_file_with_unmapped.write(sequence_groups_unmapped[i])
        tsv_file_with_unmapped.close()

Any help pls!!

SCATTER-GATHER GATK • 858 views

ADD COMMENT • link 3.8 years ago by AR • 0