Question: Generate a reference-ordered data (ROD) file from FASTA
gravatar for omer.asr
19 months ago by
omer.asr0 wrote:

Hey all, I have a FASTA file containing two sequences of the same locus, from two different bacteria strains. The sequences are ~5.5kbp long and differ by ~100 SNPs. I'd like to generate a ROD file (acceptable as input by GATK's base quality score recalibration procedure --knownSites parameter) from it. How can I do it?

Thanks, Omer

snp • 534 views
ADD COMMENTlink written 19 months ago by omer.asr0

According to the GATK, the --knownSites parameter accepts data in many formats:

--knownSites / -knownSites

A database of known polymorphic sites This algorithm treats every reference mismatch as an indication of error. However, real genetic variation is expected to mismatch the reference, so it is critical that a database of known polymorphic sites (e.g. dbSNP) is given to the tool in order to mask out those sites.

This argument supports reference-ordered data (ROD) files in the following formats: BCF2, BEAGLE, BED, BEDTABLE, EXAMPLEBINARY, RAWHAPMAP, REFSEQ, SAMPILEUP, SAMREAD, TABLE, VCF, VCF3


Based on the data you've got, I would either aim to derive a VCF of the ~100 SNPs and then supply that to --knownSites, or else generate a SAMtools pileup. I would also try SAMtools mpileup, which allows you to generate a single pileup from multiple samples. For both of these ideas that I mention, you will of course have to first align your FASTA sequences to a reference bacterial genome (if available?).

ADD REPLYlink written 19 months ago by Kevin Blighe41k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1479 users visited in the last hour