Entering edit mode
6.5 years ago
omer.asr
•
0
Hey all, I have a FASTA file containing two sequences of the same locus, from two different bacteria strains. The sequences are ~5.5kbp long and differ by ~100 SNPs. I'd like to generate a ROD file (acceptable as input by GATK's base quality score recalibration procedure --knownSites parameter) from it. How can I do it?
Thanks, Omer
According to the GATK, the --knownSites parameter accepts data in many formats:
Source: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php#--knownSites
Based on the data you've got, I would either aim to derive a VCF of the ~100 SNPs and then supply that to --knownSites, or else generate a
SAMtools pileup
. I would also trySAMtools mpileup
, which allows you to generate a single pileup from multiple samples. For both of these ideas that I mention, you will of course have to first align your FASTA sequences to a reference bacterial genome (if available?).