vcf file SNP position extraction
0
0
Entering edit mode
21 months ago
evelyn ▴ 170

Hello,

I extracted a specific SNP from vcf with multiple samples using tabix and then I realized there are some sample names and their respective calls do not have a tab space between them and appear to be in one cell in excel. Is there a way in linux to solve this problem rather than doing it manually? e.g., these two calls have no tab space in vcf and in excel they appear in one cell rather than two cells:

0/0:10:11,0:11:379:0:0:0,-3.31,-34.44 0/0:9:9,0:9:337:0:0:0,-2.70,-30.6


Thank you!

SNP • 274 views
0
Entering edit mode

VCF values are always separated by tabs, so the problem is probably related to copy-pasting from Linux to Windows. You could probably get around it by copying the file with some software like mobaXterm or WinSCP (rather than text copy-paste), or by doing the analysis on Linux (without using Excel). Can you explain what you're trying to achieve exactly?

0
Entering edit mode

I tried to view it in Text wrangler and then copy in excel but it's still the same. I want to see if there is any significant difference between reference and alternative calls between samples for one SNP. Traditionally, I copy the SNP information in excel, convert the calls to R (0/0), A (0/1) and H (1/1) and run a t-test.

0
Entering edit mode

You could use bcftools query to get what you need. bcftools query -f [%GT\t]\n would get you all the GT information tab separated from the VCF. This result might be easier to work with on Excel.

I don't see how a VCF was generated with mixed white spaces. Something probably went wrong during a copy-paste operation. For example, iTerm2 and TextWrangler (I think) have options to auto-convert tabs to spaces. Use the shell directly and skip any copy-paste operations for the time being.

0
Entering edit mode

I used bcftools query and it changed it to C/C C/C C/C C/C format. how can I change calls 0/0 to R, 0/1 to H and 1/1 to A without changing vcf file format. I did it with awk earlier but it changes the file format.

0
Entering edit mode

You probably used %TGT and not %GT. The latter would give you FORMAT/GT as-is.

You want the information, the format does not matter. VCF files and Excel don't get along well. Use the tabular output from bcftols query, forget about retaining VCF format.

0
Entering edit mode

Ultimately, I wanted the translated version of calls to do the significance test.

0
Entering edit mode

If you wish to have your cake and eat it too, you'll need the %REF and %ALT fields, and then you'll need to look for corresponding REF/REF, REF/ALT and ALT/ALT combinations with awk.

0
Entering edit mode

Yes, I have REF and ALT fields. I wanted to change all the calls to R, H and A based on their respective REF and ALT fields. I tried awk but it changes to single letter REF and ALT calls only instead of R and A. Moreover, I am not sure how to deal with hetero calls.

e.g., ch1 109 . A C A A A A A A A/C C C C C . A A However, I want it like: ch1 109 . A C R R R R R R H A A A A . R R

0
Entering edit mode

Compare each NF>5 field to $3 and $4. If $6 (for example) is $3/\$3, replace it with R. Do a similar thing to get H and A.

0
Entering edit mode