Question: Generate a reference-ordered data (ROD) file from FASTA
0
gravatar for omer.asr
2.0 years ago by
omer.asr0
Israel
omer.asr0 wrote:

Hey all, I have a FASTA file containing two sequences of the same locus, from two different bacteria strains. The sequences are ~5.5kbp long and differ by ~100 SNPs. I'd like to generate a ROD file (acceptable as input by GATK's base quality score recalibration procedure --knownSites parameter) from it. How can I do it?

Thanks, Omer

snp • 620 views
ADD COMMENTlink written 2.0 years ago by omer.asr0

According to the GATK, the --knownSites parameter accepts data in many formats:

--knownSites / -knownSites

A database of known polymorphic sites This algorithm treats every reference mismatch as an indication of error. However, real genetic variation is expected to mismatch the reference, so it is critical that a database of known polymorphic sites (e.g. dbSNP) is given to the tool in order to mask out those sites.

This argument supports reference-ordered data (ROD) files in the following formats: BCF2, BEAGLE, BED, BEDTABLE, EXAMPLEBINARY, RAWHAPMAP, REFSEQ, SAMPILEUP, SAMREAD, TABLE, VCF, VCF3

Source: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php#--knownSites

Based on the data you've got, I would either aim to derive a VCF of the ~100 SNPs and then supply that to --knownSites, or else generate a SAMtools pileup. I would also try SAMtools mpileup, which allows you to generate a single pileup from multiple samples. For both of these ideas that I mention, you will of course have to first align your FASTA sequences to a reference bacterial genome (if available?).

ADD REPLYlink written 2.0 years ago by Kevin Blighe48k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2392 users visited in the last hour