SNP filtering
1
0
Entering edit mode
8 weeks ago
rheab1230 ▴ 20

Hello everyone,

I am trying to filter .vcf files to contain only single point SNP and remove indel.

I have remove indel from the .vcf file by using the following command:

grep -v \$'\t-\t' in.vcf > no_indels.vcf


I am able to remove indel from .vcf file.

But I also want to remove any SNP corresponding to CAT --> C and C --> CTATGG

I just want my .vcf file to contain only A-T, or T-A or G-C.

I mean just one single point mutation only.

files .vcf SNP • 972 views
1
Entering edit mode
8 weeks ago
Ram 34k

Please don't use grep to do this. Use a reliable VCF parser instead.

bcftools view -v snps should do the trick.

0
Entering edit mode

Okay,

I have use bcftools as mentioned to filter out SNP from .vcf files.

I got the output file. but I also received some message after running the command.

[W::bcf_hdr_check_sanity] GL should be declared as Number=G
[W::vcf_parse] Contig '1' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse_format] FORMAT 'PP' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'BD' is not defined in the header, assuming Type=String

0
Entering edit mode

These warnings can be ignored most of the time, but it's usually better to fix them. For example, your header seems to be improper. If you're working on uncompressed .vcf files directly, try running

bgzip in.vcf
tabix -p vcf in.vcf.gz


after installing htslib. This will make sure your VCF header is proper.

0
Entering edit mode

Okay, I am able to do it. Thank You. Do you also know how to update id coloumn in .vcf file with rsid in dbSNP?

0
Entering edit mode

Your question on this topic was already answered by Wouter, I believe. Did that solution not work for you? Explore bcftools annotate if you'd like another option.

0
Entering edit mode

No, Its not working. the output file is not as required by predixcan. I am exploring bcftools annotate option But i want to know how should i download dbSNP142.vcf file. bcftools annotate function require dbSNP.vcf file but I have dbSNP.txt file.

0
Entering edit mode

I actually use bcftools to remove indels from .vcf files.as you mentioned. It worked. now my .vcf files contain only SNP. Then i thought of using annovar software to annotate my .vcf file with dbSNP142. but its getting annotated and its updated with rsid. but now indel is getting incorporated in my file

0
Entering edit mode

That does not make sense - if you're trying to annotate an SNPs only input file, the annotation process cannot add new entries (which is what would need to happen for you to see indels again).

0
Entering edit mode

You can annotate using a text file as well. NCBI's FTP site will have the dbSNP VCF file, but you should be able to use a tab delimited file with bcftools annotate

0
Entering edit mode

I am using bcftools. the code is:

#!/bin/bash
bcftools annotate -c ID \
-a 00-All.vcf.gz \
> file1.vcf


the error is:

Failed to open GEUVADIS.chr1.genotype.vcf.gz: not compressed with bgzip


So then i created the file using bgzip.

the code is :

bgzip GEUVADIS.chr1.genotype.vcf


then i run the code again:

the code:

!/bin/bash
bcftools annotate -c ID \ -a 00-All.vcf.gz \ GEUVADIS.chr1.genotype.vcf.gz \ > file1.vcf


Now i am getting this error:

[W::bcf_hdr_check_sanity] GL should be declared as Number=G
Failed to open 00-All.vcf.gz: could not load index

0
Entering edit mode

Please try tabix -p vcf instead of just tabix.

Please see how I've combined and formatted your comments so it's easier to read - you can also use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (text becomes text), or select a chunk of text and use the highlighted button to format it as a code block. If your code has long lines with a single command, break those lines into multiple lines with proper escape sequences so they're easier to read and still run when copy-pasted.

0
Entering edit mode

I now use tabix -p GEUVADIS.chr1.genotype.vcf.gz but still getting error: unrecognized preset 'GEUVADIS.chr1.genotype.vcf.gz'

0
Entering edit mode

Please pay close attention to the command I wrote initially. The correct command is tabix -p vcf GEUVADIS.chr1.genotype.vcf.gz The -p vcf asks tabix to use the vcf preset, the vcf part is not a stand in for the actual vcf file.

0
Entering edit mode

I didn't understand. I am sorry but i am completely new to bioinformatics. Should i use the following command: tabix -p GEUVADIS.chr1.genotype.vcf.gz GEUVADIS.genotype.vcf.gz(all the combined vcf files)?

0
Entering edit mode
tabix -p vcf GEUVADIS.chr1.genotype.vcf.gz


Type that exactly

0
Entering edit mode

I am able to do the above command and not receive any error.

Then when I use annotate function of bcftools its showing error. the command:

bcftools annotate -c ID -a 00-All.vcf.gz GEUVADIS.chr2.genotype.vcf.gz


the error:

[W::bcf_hdr_check_sanity] GL should be declared as Number=G
[W::vcf_parse_format] FORMAT 'PP' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'BD' is not defined in the header, assuming Type=String
Encountered error, cannot proceed. Please check the error output above.

0
Entering edit mode

Those are warnings and should not kill the process, but they indicate there's something wrong with the VCF file. Can you check the VCF headers and see what's going on?

I'm afraid this thread is becoming unmanageable and quite niche, and I won't be able to help you much longer with this.