problem with vcf file
2
0
Entering edit mode
7.0 years ago

Hi all,

I have problem with vcf file, the problem is (-) in one or two columns in some lines.

I am trying to remove them or replace them but I couldn't do it. please could anyone help with that.

Thanks in advance,

Ahmed

vcf linux SNP • 3.4k views
ADD COMMENT
1
Entering edit mode

VCF is a highly structured format, can you please provide information on:

  • which columns

  • how you generate the vcf (or where do you download it from)

  • how are you trying to remove them

ADD REPLY
1
Entering edit mode

Can you please add more details to your question? What is the error message? What are you trying to do?

ADD REPLY
0
Entering edit mode

I am trying to fix dbsnp file that has been downloaded from NCBI, the problem with column 4 and 5. at the end of the file I have (-) in the REF or ALT. when I try to validate variant I had this error ((##### ERROR MESSAGE: The provided VCF file is malformed at approximately line number 13866253: unparsable vcf record with allele -

when I check the line: I found the (-) in column 5 and also some lines have the same in column 4..

mbxao2@kraken[R-drive] sed -n '13866253p' dbsnp_sorted.vcf                                                           [ 6:40PM]
33  1421396 rs15996913  T   -,C .   .   RSPOS=1421396;GENEINFO=426899:FAIM2;dbSNPBuildID=122;SAO=0;VC=in-del;VLD;VP=050000800005040100000210

I have tried the below but didn't work:

sed -e 's/- / /g' test251.vcf > test252.vcf

awk '$4 != "-" && $5  != "-"' 00-All_mod.vcf > indel2.txt
awk '$4 != "-"' {print} 00-All_mod.vcf > trial22.vcf
ADD REPLY
1
Entering edit mode

I sense a disturbance in the force, precisely in the section "breakends" of the VCF manual:

https://samtools.github.io/hts-specs/VCFv4.2.pdf

I am not sure if this will help you, but I would give it a look and maybe there is what you search for. I hope for you, at least!

P.S. generally I am not a fan of editing heavily formatted files with oneliners, as they were normal text files. Especially with SAM and VCF formats, you'll never know everything about them. Every time a new discovery!

ADD REPLY
0
Entering edit mode

From where and how you dloaded you dbSNP file? I am unable to find even a single reference for 'rs15996913' while doing google search!

ADD REPLY
0
Entering edit mode
ADD REPLY
1
Entering edit mode
7.0 years ago

Those offending lines are all "in-dels" and they are formatted in an old style where the context is not given. To reformat them correctly, you need to look at the corresponding position of genome and find out what are the bases in context. Alternatively, you may just filter out all the indels, if they are not critical for your analysis.

zgrep -v in-del vcf_chr_33.vcf.gz > chr33.snp.vcf
ADD COMMENT
0
Entering edit mode

Thanks Santosh. when I ran your script the result was so weird. check the below error

90.1, AADN04023940.1, AADN04024429.1, AADN04024431.1, AADN04024444.1, AADN04024144.1, AADN04024001.1, AADN04014231.1, AADN04024498.1, AADN04024158.1, AADN04007914.1, AADN04024197.1, AADN04024254.1, AADN04024353.1, AADN04024314.1, AADN04024215.1, AADN04012257.1, AADN04024035.1, AADN04002975.1, AADN04024046.1, AADN04005144.1, AADN04023967.1, AADN04024299.1, AADN04024287.1, AADN04024354.1, AADN04024189.1, AADN04024268.1, AADN04005826.1, AADN04014001.1, AADN04023962.1, AADN04024296.1, AADN04005569.1, AADN04024134.1, AADN04023992.1, AADN04024274.1, AADN04005217.1, AADN04024275.1, AADN04007607.1, AADN04023976.1, AADN04024086.1, AADN04023974.1, AADN04024084.1, AADN04024045.1, AADN04024121.1, AADN04023941.1, AADN04023935.1, AADN04008379.1, AADN04024312.1, AADN04023978.1, AADN04024034.1, AADN04024375.1, AADN04024080.1, AADN04024118.1, AADN04024186.1, AADN04024070.1, AADN04024185.1, AADN04018283.1, AADN04023936.1, AADN04024255.1, AADN04024071.1, AADN04024100.1, AADN04024105.1, AADN04017094.1, AADN04020914.1, AADN04023942.1, AADN04024023.1, AADN04010121.1, AADN04024345.1, AADN04024305.1, AADN04024310.1, AADN04024358.1, AADN04024369.1, AADN04006246.1, AADN04023943.1, AADN04009947.1, AADN04024079.1, AADN04024313.1, AADN04024130.1, AADN04024309.1, AADN04023972.1, AADN04024089.1, AADN04024213.1, AADN04024292.1, AADN04024125.1, AADN04017325.1, AADN04024346.1, AADN04024441.1, AADN04024005.1, AADN04024020.1, AADN04024077.1, AADN04024009.1, AADN04024032.1, AADN04024192.1, AADN04024328.1, AADN04024038.1, AADN04023969.1, AADN04024326.1, AADN04024056.1, AADN04013227.1, AADN04024224.1, AADN04024243.1, AADN04024206.1, AADN04024246.1, AADN04024176.1, AADN04016551.1, AADN04024014.1, AADN04024129.1, AADN04024198.1, AADN04023983.1, AADN04024164.1, AADN04024167.1, AADN04023944.1, AADN04024374.1, AADN04020824.1, AADN04024306.1, AADN04024106.1, AADN04024145.1, AADN04024281.1, AADN04024351.1, AADN04020241.1, AADN04024322.1, AADN04024109.1, AADN04024141.1, AADN04024156.1, AADN04024360.1, AADN04023958.1, AADN04023959.1, AADN04023960.1, AADN04023961.1, AADN04023985.1, AADN04023986.1, AADN04023989.1, AADN04023990.1, AADN04024101.1, AADN04024102.1, AADN04024103.1, AADN04024104.1, AADN04024110.1, AADN04024112.1, AADN04024139.1, AADN04024140.1, AADN04024272.1, AADN04024273.1, AADN04024290.1, AADN04024291.1, AADN04024300.1, AADN04024301.1, AADN04024303.1, AADN04024304.1, AADN04024338.1, AADN04024339.1, AADN04024343.1, AADN04024344.1, AADN04024352.1, AADN04024355.1, AADN04024356.1, AADN04024361.1, AADN04024364.1, AADN04024365.1, AADN04024376.1, AADN04024377.1, AADN04024350.1, AADN04024181.1, AADN04024207.1, AADN04017424.1, AADN04024052.1, AADN04024147.1, AADN04024124.1, AADN04024237.1, AADN04023953.1, AADN04024044.1, AADN04023979.1, AADN04024219.1, AADN04024252.1, AADN04024119.1, AADN04024030.1, AADN04024049.1, AADN04024230.1, AADN04021209.1, AADN04024085.1, AADN04024262.1, AADN04024278.1, AADN04024000.1, AADN04024163.1, AADN04024263.1, AADN04024383.1, AADN04024228.1, AADN04024279.1, AADN04004651.1, AADN04024036.1, AADN04024209.1, AADN04024241.1, AADN04024212.1, AADN04024126.1, AADN04024155.1, AADN04023973.1, AADN04023981.1, AADN04024136.1, AADN04024217.1, AADN04024083.1, AADN04024072.1, AADN04024349.1, AADN04024076.1, AADN04024216.1, AADN04024251.1, AADN04024067.1, AADN04024316.1, AADN04016302.1, AADN04023971.1, AADN04024053.1, AADN04024233.1, AADN04024245.1, AADN04024261.1, AADN04024266.1, AADN04024152.1, AADN04024203.1, AADN04024229.1, AADN04024244.1, AADN04024123.1, AADN04024063.1, AADN04024327.1, AADN04010267.1, AADN04024091.1, AADN04024253.1, AADN04024039.1, AADN04024295.1, AADN04023987.1, AADN04024027.1, AADN04024293.1, AADN04024297.1, AADN04024081.1, AADN04024061.1, AADN04024068.1, AADN04023977.1, AADN04024367.1, AADN04023963.1, AADN04024107.1, AADN04024235.1, AADN04024382.1, AADN04023957.1, AADN04024127.1, AADN04024308.1, AADN04024173.1, AADN04023954.1, AADN04024078.1, AADN04024220.1, AADN04009592.1, AADN04024239.1, AADN04024298.1, AADN04024004.1, AADN04024006.1, AADN04024122.1, AADN04024099.1, AADN04024264.1, AADN04024318.1, AADN04024319.1, AADN04024116.1, AADN04024214.1, AADN04014749.1, AADN04005924.1, AADN04024307.1, AADN04010376.1, AADN04023956.1, AADN04024050.1, AADN04000887.1, AADN04024111.1, AADN04024280.1, AADN04024117.1, AADN04024221.1, AADN04024222.1, AADN04024342.1, AADN04024128.1, AADN04024146.1]
##### ERROR --------------------------------------------------------------------------------
ADD REPLY
0
Entering edit mode

that's weird! Where is that <h5> error coming from? Could you post your exact commandline and the files (or part of the file) you are running on?

ADD REPLY
0
Entering edit mode

That <h5> was a formatting error. I have fixed the formatting.

ADD REPLY
0
Entering edit mode

That makes no sense your output should have looked something like this for vcf_chr_33.vcf.gz as noted in @Santosh's answer (if that is what you used). Basically minus all lines that had VC=in-del

33      1172178 rs3136901       C       T       .       .       RSPOS=1172178;GENEINFO=107055416:COPZ1;dbSNPBuildID=104;SAO=0;VC=snp;VP=050100000305000000000100
33      1188596 rs3137350       T       C       .       .       RSPOS=1188596;GENEINFO=107055417:LOC107055417;dbSNPBuildID=104;SAO=0;VC=snp;VP=050100000305000000000
100
33      1358968 rs3137351       C       A       .       .       RSPOS=1358968;RV;dbSNPBuildID=104;SAO=0;VC=snp;VP=050100000005000000000100
33      135849  rs3137550       G       A       .       .       RSPOS=135849;RV;GENEINFO=426871:METTL7A|426872:TMPRSS12;dbSNPBuildID=104;SAO=0;VC=snp;VLD;VP=0501004
20005000000000100
ADD REPLY
0
Entering edit mode

I'm now wondering about the contents of your VCF. Could you post all the commands leading to generation of your VCF (starting from the dload step)?

ADD REPLY
0
Entering edit mode
7.0 years ago
wget ftp://ftp.ncbi.nih.gov/snp/organisms/chicken_9031/VCF/00-All.vcf.gz

gunzip 00-All.vcf.gz

cat 00-All.vcf |sed -r 's|SERPINB10 CPOX|SERPINB10_CPOX|; s|SET domain containing 5|SETD5|;' >check_all.vcf

sed -e 's/SET domain containing /SETdomaincontaining/g' check_all.vcf > test252.vcf

bgzip test252.vcf

tabix -p vcf test252.vcf.gz

vcf-sort Gallus_gallus.Gallus_gallus-5.0.dna.chromosome.1.dict  test252.vcf.gz > dbsnp_sorted.vcf.gz

java -d64 -Xmx48g -jar /home/mbxao2/R-drive/tools/GATK/GenomeAnalysisTK.jar -T ValidateVariants -R Gallus_gallus.Gallus_gallus-5.0.dna.chromosome.1.fa -V dbsnp_sorted.vcf.gz --validationTypeToExclude ALL

at the final stage the error appear all the time.

ADD COMMENT
0
Entering edit mode

You should have added this information against @Santosh's post above. Adding this as an answer throws off the logical flow of this thread. If you can move the content and delete this post that would be great.

It is a bit hard to tell but is this still related/in continuation of the original question you had asked?

ADD REPLY
0
Entering edit mode

It is still related to the question because from the beginning I have asked about all the file. @Santosh checked the error and found it in chr33 so he posted his answer about chr33.

ADD REPLY
1
Entering edit mode

Yes but your information is not an answer so you shouldn't have posted it as an answer. So you should move the content to the appropriate place.

ADD REPLY
1
Entering edit mode

Then you should edit the original post and add/organize the information in such a way that the question and this entire thread makes sense to someone who will come by this in future. As things stand now, I have lost track of what is happening and others may have the same problem in future.

ADD REPLY
0
Entering edit mode

Ok, now I see the complete picture. Unless you say which tool is generating the error, it's difficult to understand what is happening! By seeing your commandLine, I can see at least one error: You are using the reference as chr1 (-R Gallus_gallus.Gallus_gallus-5.0.dna.chromosome.1.fa), whereas your vcf is composed of all of the chromosomes. Your Reference should contain at least all the chromosomes / contigs that the VCF file has. There might be other errors, but first see if this resolves the issue. If not, paste the GATK complete errror output again. I'm quite sure that you have missed some part of GATK error logging.

ADD REPLY
0
Entering edit mode

I have checked the reference, it contains all the chromosomes and the size is 1.3 GB. so, probably it is not the problem.

ADD REPLY
0
Entering edit mode

Then please post the whole GATK error output. And please keep the posts organized, as moderators have flagged. That is also for your advantage, because that way other people looking at your post could easily follow the conversation, and give some useful advice.

ADD REPLY

Login before adding your answer.

Traffic: 1942 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6