vcf file SNP position extraction
0
0
Entering edit mode
4.2 years ago
evelyn ▴ 230

Hello,

I extracted a specific SNP from vcf with multiple samples using tabix and then I realized there are some sample names and their respective calls do not have a tab space between them and appear to be in one cell in excel. Is there a way in linux to solve this problem rather than doing it manually? e.g., these two calls have no tab space in vcf and in excel they appear in one cell rather than two cells:

0/0:10:11,0:11:379:0:0:0,-3.31,-34.44 0/0:9:9,0:9:337:0:0:0,-2.70,-30.6

Thank you!

SNP • 704 views
ADD COMMENT
0
Entering edit mode

VCF values are always separated by tabs, so the problem is probably related to copy-pasting from Linux to Windows. You could probably get around it by copying the file with some software like mobaXterm or WinSCP (rather than text copy-paste), or by doing the analysis on Linux (without using Excel). Can you explain what you're trying to achieve exactly?

ADD REPLY
0
Entering edit mode

I tried to view it in Text wrangler and then copy in excel but it's still the same. I want to see if there is any significant difference between reference and alternative calls between samples for one SNP. Traditionally, I copy the SNP information in excel, convert the calls to R (0/0), A (0/1) and H (1/1) and run a t-test.

ADD REPLY
0
Entering edit mode

You could use bcftools query to get what you need. bcftools query -f [%GT\t]\n would get you all the GT information tab separated from the VCF. This result might be easier to work with on Excel.

I don't see how a VCF was generated with mixed white spaces. Something probably went wrong during a copy-paste operation. For example, iTerm2 and TextWrangler (I think) have options to auto-convert tabs to spaces. Use the shell directly and skip any copy-paste operations for the time being.

ADD REPLY
0
Entering edit mode

I used bcftools query and it changed it to C/C C/C C/C C/C format. how can I change calls 0/0 to R, 0/1 to H and 1/1 to A without changing vcf file format. I did it with awk earlier but it changes the file format.

ADD REPLY
0
Entering edit mode

You probably used %TGT and not %GT. The latter would give you FORMAT/GT as-is.

You want the information, the format does not matter. VCF files and Excel don't get along well. Use the tabular output from bcftols query, forget about retaining VCF format.

ADD REPLY
0
Entering edit mode

Ultimately, I wanted the translated version of calls to do the significance test.

ADD REPLY
0
Entering edit mode

If you wish to have your cake and eat it too, you'll need the %REF and %ALT fields, and then you'll need to look for corresponding REF/REF, REF/ALT and ALT/ALT combinations with awk.

ADD REPLY
0
Entering edit mode

Yes, I have REF and ALT fields. I wanted to change all the calls to R, H and A based on their respective REF and ALT fields. I tried awk but it changes to single letter REF and ALT calls only instead of R and A. Moreover, I am not sure how to deal with hetero calls.

e.g., ch1 109 . A C A A A A A A A/C C C C C . A A However, I want it like: ch1 109 . A C R R R R R R H A A A A . R R

ADD REPLY
0
Entering edit mode

Compare each NF>5 field to $3 and $4. If $6 (for example) is $3/$3, replace it with R. Do a similar thing to get H and A.

ADD REPLY
0
Entering edit mode
ADD REPLY

Login before adding your answer.

Traffic: 3067 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6