BaseRecalibration step GATK
0
0
Entering edit mode
3.8 years ago
AR • 0

Hi all,

I am trying to replicate GATK WDL commands on bash for Novaseq whole genome sequencing data of humans. I ran scatter-gather python script provided by GATK. I was able to produce 18 different files as GATK team suggested. But when I am running the BaseRecalibrator step, with genomic intervals option (files generated from scatter-gather), BaseRecalibrator is not accepting these files and giving error: Invalid file and not in proper format. So, I went the other way around and used the WGS.intervals_list provided by GATK reference bundle GRCh38. I am able to run BaseRecalibrator with the WGS.intervals_list and it generated a single recalibration report. So, my confusion is:

  1. Will it effect the overall output of my GATK pipeline if I am not using scatter-gather step and instead directly using WGS.intervals_list provided by GATK reference bundle GRCh38 for BaseRecalibrator? Other than speed (parallelism).

  2. Why I am not able to use the scatter-gather output files for the BaseRecalibrator step?

  3. What if I do not provide genomic interval list in -L option in BaseRecalibrator step? Will the recal_table be different from the one generated when provided with WGS.intervals_list provided by GATK reference bundle GRCh38?

For your reference, I am attaching here the scatter-gather script. Pardon me if my questions are too naive. I just want to understand the basics. I have searched everywhere for this.

> import sys
with open("/Users/AARTI/Desktop/hg38.dict", "r") as ref_dict_file:
    sequence_tuple_list = []
    longest_sequence = 0
    for line in ref_dict_file:
        if line.startswith("@SQ"):
            line_split = line.split("\t")
            # (Sequence_Name, Sequence_Length)
            sequence_tuple_list.append((line_split[1].split("SN:")[1], int(line_split[2].split("LN:")[1])))
    longest_sequence = sorted(sequence_tuple_list, key=lambda x: x[1], reverse=True)[0][1]
    print(sequence_tuple_list[0])
    print(longest_sequence)
# We are adding this to the intervals because hg38 has contigs named with embedded colons and a bug in GATK strips off
# the last element after a :, so we add this as a sacrificial element.
hg38_protection_tag = ":1+"
max_line = 0
# initialize the tsv string with the first sequence
tsv_string = sequence_tuple_list[0][0]
temp_size = sequence_tuple_list[0][1]
for sequence_tuple in sequence_tuple_list[1:]:
    if temp_size + sequence_tuple[1] <= longest_sequence and max_line <= 1600:
        temp_size += sequence_tuple[1]
        if not ":" in sequence_tuple[0]:
            tsv_string += "\t" + sequence_tuple[0]
        else:
            tsv_string += "\t" + sequence_tuple[0] + hg38_protection_tag
        max_line = max_line + 1
    else:
        if not ":" in sequence_tuple[0]:
            tsv_string += "\n" + sequence_tuple[0]
        else:
            tsv_string += "\n" + sequence_tuple[0] + hg38_protection_tag
        temp_size = sequence_tuple[1]
        max_line = 0
# add the unmapped sequences as a separate line to ensure that they are recalibrated as well
tsv_string += '\n' + "unmapped"
sequence_groups_unmapped = tsv_string.split("\n")
for i in range(0, len(sequence_groups_unmapped)):
    with open("sequence_grouping_with_unmapped_{0}.txt".format(i), "w") as tsv_file_with_unmapped:
        tsv_file_with_unmapped.write(sequence_groups_unmapped[i])
        tsv_file_with_unmapped.close()

Any help pls!!

SCATTER-GATHER GATK • 858 views
ADD COMMENT

Login before adding your answer.

Traffic: 2350 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6