Question: How to fix mismatch between snp file and reference?
1
gravatar for Sharon
13 months ago by
Sharon420
Sharon420 wrote:

Hi Everyone I am doing rnaseq variant calling on mouse, I am using the reference and indels and snps for mouse from here, the same source [ftp://ftp-mouse.sanger.ac.uk/]. indels file indels.dbSNP142 did not cause any issues with indel realigner, but snps snps.dbsnp142 file throws the following error with baserecabliration:

java -jar ${GATK}/GenomeAnalysisTK.jar \
    -T BaseRecalibrator \
    -R ${WHOLEGENOME} \
    -I ${WHERE}/${CURRENT}-realigned.bam \
    -knownSites ${DBSNP} \
    -o ${WHERE}/${CURRENT}.recal_data.table

ERROR MESSAGE: Input files snps.dbSNP142.vcf and reference have incompatible contigs. Error details: The contig order in snps.dbSNP142.vcf and reference is not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files.. ##### ERROR snps.dbSNP142.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, X, Y, MT] ##### ERROR reference contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 3, 4, 5, 6, 7, 8, 9, MT, X, Y, JH584295.1, JH584292.1, GL456368.1, GL456396.1, GL456359.1, GL456382.1, GL456392.1, GL456394.1, GL456390.1, GL456387.1, GL456381.1, GL456370.1, GL456372.1, GL456389.1, GL456378.1, GL456360.1, GL456385.1, GL456383.1, GL456213.1, GL456239.1, GL456367.1, GL456366.1, GL456393.1, GL456216.1, GL456379.1, JH584304.1, GL456212.1, JH584302.1, JH584303.1, GL456210.1, GL456219.1, JH584300.1, JH584298.1, JH584294.1, GL456354.1, JH584296.1, JH584297.1, GL456221.1, JH584293.1, GL456350.1, GL456211.1, JH584301.1, GL456233.1, JH584299.1]

I tried this too as in the link in the error,:

java -jar ${PICARD}/picard.jar SortVcf \
        I= ${DBSNP} \
        O= sorted.vcf \
        SEQUENCE_DICTIONARY= GRCm38_68.dict

But then i got:

Exception in thread "main" java.lang.IllegalArgumentException: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=X,length=171031299,dict_index=19,assembly=null) was found when SAMSequenceRecord(name=MT,length=16299,dict_index=19,assembly=null) was expected. at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:126) at picard.vcf.SortVcf.doWork(SortVcf.java:95) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:228) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104) Caused by: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=X,length=171031299,dict_index=19,assembly=null) was found when SAMSequenceRecord(name=MT,length=16299,dict_index=19,assembly=null) was expected. at htsjdk.samtools.SAMSequenceDictionary.assertSameDictionary(SAMSequenceDictionary.java:170) at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:124) ... 4 more make: * [sortvcf] Error 1 sortvc

Any hint?

Thanks

rna-seq • 584 views
ADD COMMENTlink modified 13 months ago by ibelcarri10 • written 13 months ago by Sharon420
1

When you did the original alignment you did not include unplaced and unlocalized contigs in your reference. The solution you linked to is only applicable when the sort order is wrong but there are no mismatches. I suppose you could remove lines with the offending references from your SNP reference.

ADD REPLYlink modified 13 months ago • written 13 months ago by genomax65k

I did not understand this part <unplaced and="" unlocalized="">? And also you mean, I manually remove those extras stuff from the snp file? {JH584295.1, JH584292.1, GL456368.1, GL456396.1, GL456359.1, GL456382.1, GL456392.1, GL456394.1, GL456390.1, GL456387.1, GL456381.1, GL456370.1, GL456372.1, GL456389.1, GL456378.1, GL456360.1, GL456385.1, GL456383.1, GL456213.1, GL456239.1, GL456367.1, GL456366.1, GL456393.1, GL456216.1, GL456379.1, JH584304.1, GL456212.1, JH584302.1, JH584303.1, GL456210.1, GL456219.1, JH584300.1, JH584298.1, JH584294.1, GL456354.1, JH584296.1, JH584297.1, GL456221.1, JH584293.1, GL456350.1, GL456211.1, JH584301.1, GL456233.1, JH584299.1] Thanks

ADD REPLYlink written 13 months ago by Sharon420

Like the human genome those GL* and JH* contigs are known to be present in the mouse genome but their precise location is not known. Did you delete the index file before running SortVcf?

ADD REPLYlink written 13 months ago by genomax65k

No, I did not delete anything.

ADD REPLYlink written 13 months ago by Sharon420

@Goutham says this in the post you linked above.

Note that you may need to delete the index file that gets created automatically for your new VCF by the Picard tool. GATK will automatically regenerate an index file for your VCF.

ADD REPLYlink written 13 months ago by genomax65k

This is what I don't understand, which index they mean? The index I downloaded with the reference? The index of the snp file is deleted already.

ADD REPLYlink modified 13 months ago • written 13 months ago by Sharon420
1
gravatar for ibelcarri
13 months ago by
ibelcarri10
ibelcarri10 wrote:

I had the exact same problem and this helped me. C: How to sort a VCF file lexicographically by chromosome number?

ADD COMMENTlink written 13 months ago by ibelcarri10

Thanks ibelcarri, I just used your mentioned post, and it works now !

ADD REPLYlink written 13 months ago by Sharon420
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 910 users visited in the last hour