Question: How to fix mismatch between snp file and reference?
1
gravatar for Sharon
17 months ago by
Sharon440
Sharon440 wrote:

Hi Everyone I am doing rnaseq variant calling on mouse, I am using the reference and indels and snps for mouse from here, the same source [ftp://ftp-mouse.sanger.ac.uk/]. indels file indels.dbSNP142 did not cause any issues with indel realigner, but snps snps.dbsnp142 file throws the following error with baserecabliration:

java -jar ${GATK}/GenomeAnalysisTK.jar \
    -T BaseRecalibrator \
    -R ${WHOLEGENOME} \
    -I ${WHERE}/${CURRENT}-realigned.bam \
    -knownSites ${DBSNP} \
    -o ${WHERE}/${CURRENT}.recal_data.table

ERROR MESSAGE: Input files snps.dbSNP142.vcf and reference have incompatible contigs. Error details: The contig order in snps.dbSNP142.vcf and reference is not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files.. ##### ERROR snps.dbSNP142.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, X, Y, MT] ##### ERROR reference contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 3, 4, 5, 6, 7, 8, 9, MT, X, Y, JH584295.1, JH584292.1, GL456368.1, GL456396.1, GL456359.1, GL456382.1, GL456392.1, GL456394.1, GL456390.1, GL456387.1, GL456381.1, GL456370.1, GL456372.1, GL456389.1, GL456378.1, GL456360.1, GL456385.1, GL456383.1, GL456213.1, GL456239.1, GL456367.1, GL456366.1, GL456393.1, GL456216.1, GL456379.1, JH584304.1, GL456212.1, JH584302.1, JH584303.1, GL456210.1, GL456219.1, JH584300.1, JH584298.1, JH584294.1, GL456354.1, JH584296.1, JH584297.1, GL456221.1, JH584293.1, GL456350.1, GL456211.1, JH584301.1, GL456233.1, JH584299.1]

I tried this too as in the link in the error,:

java -jar ${PICARD}/picard.jar SortVcf \
        I= ${DBSNP} \
        O= sorted.vcf \
        SEQUENCE_DICTIONARY= GRCm38_68.dict

But then i got:

Exception in thread "main" java.lang.IllegalArgumentException: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=X,length=171031299,dict_index=19,assembly=null) was found when SAMSequenceRecord(name=MT,length=16299,dict_index=19,assembly=null) was expected. at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:126) at picard.vcf.SortVcf.doWork(SortVcf.java:95) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:228) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104) Caused by: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=X,length=171031299,dict_index=19,assembly=null) was found when SAMSequenceRecord(name=MT,length=16299,dict_index=19,assembly=null) was expected. at htsjdk.samtools.SAMSequenceDictionary.assertSameDictionary(SAMSequenceDictionary.java:170) at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:124) ... 4 more make: * [sortvcf] Error 1 sortvc

Any hint?

Thanks

rna-seq • 696 views
ADD COMMENTlink modified 17 months ago by ibelcarri10 • written 17 months ago by Sharon440
1

When you did the original alignment you did not include unplaced and unlocalized contigs in your reference. The solution you linked to is only applicable when the sort order is wrong but there are no mismatches. I suppose you could remove lines with the offending references from your SNP reference.

ADD REPLYlink modified 17 months ago • written 17 months ago by genomax70k

I did not understand this part <unplaced and="" unlocalized="">? And also you mean, I manually remove those extras stuff from the snp file? {JH584295.1, JH584292.1, GL456368.1, GL456396.1, GL456359.1, GL456382.1, GL456392.1, GL456394.1, GL456390.1, GL456387.1, GL456381.1, GL456370.1, GL456372.1, GL456389.1, GL456378.1, GL456360.1, GL456385.1, GL456383.1, GL456213.1, GL456239.1, GL456367.1, GL456366.1, GL456393.1, GL456216.1, GL456379.1, JH584304.1, GL456212.1, JH584302.1, JH584303.1, GL456210.1, GL456219.1, JH584300.1, JH584298.1, JH584294.1, GL456354.1, JH584296.1, JH584297.1, GL456221.1, JH584293.1, GL456350.1, GL456211.1, JH584301.1, GL456233.1, JH584299.1] Thanks

ADD REPLYlink written 17 months ago by Sharon440

Like the human genome those GL* and JH* contigs are known to be present in the mouse genome but their precise location is not known. Did you delete the index file before running SortVcf?

ADD REPLYlink written 17 months ago by genomax70k

No, I did not delete anything.

ADD REPLYlink written 17 months ago by Sharon440

@Goutham says this in the post you linked above.

Note that you may need to delete the index file that gets created automatically for your new VCF by the Picard tool. GATK will automatically regenerate an index file for your VCF.

ADD REPLYlink written 17 months ago by genomax70k

This is what I don't understand, which index they mean? The index I downloaded with the reference? The index of the snp file is deleted already.

ADD REPLYlink modified 17 months ago • written 17 months ago by Sharon440
1
gravatar for ibelcarri
17 months ago by
ibelcarri10
ibelcarri10 wrote:

I had the exact same problem and this helped me. C: How to sort a VCF file lexicographically by chromosome number?

ADD COMMENTlink written 17 months ago by ibelcarri10

Thanks ibelcarri, I just used your mentioned post, and it works now !

ADD REPLYlink written 17 months ago by Sharon440
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 503 users visited in the last hour