Hi! I'm very new to working with large .vcf files, and am trying to split up a particular file by strain (sample). There are about 1500 samples in the file, so going through manually isn't really an option (although I have managed to get it to work). My problem has been with trying to loop the process.
for sample in `bcftools query -l <FILE_NAME>`; do bcftools view -c1 -Oz -s $sample -o <PATH>/$sample.vcf.gz <FILE_NAME> done
Which iterates through each sample. But for every sample (excepting the final one in the file), it produces the following error -
". Use "--force-samples" to ignore this error.exist in header: "<SAMPLE>. I've also tried getting the sample names into a text file, and then reading from the text file:
while read sample; do bcftools view -c1 -Oz -s $sample -o <PATH>/$sample.vcf.gz <FILE_NAME> done < samples.txt
Reading through the documentation for
view, I'm under the impression that
--force-samples won't fix my problem, because that'll just ignore samples that bcftools doesn't think exists (which appears to be pretty much all of them). And my rather frenzied Googling the error has brought up nothing of note.
For completeness sake, I've also tried using
bcftools plugin split <FILE_NAME> -Oz -o <PATH>, although this fails because my department's cluster cannot open that many files at once. What confuses me is that the command works on individual samples - even if I set it at a separate variable to use the
$sample, it'll work. It's just looping it that seems to make it fail? So I'm at a loss - I'm guessing there's something wrong with how the sample names are being formatted within the command, but my understanding of bash is not good enough to figure out why.
If anyone knows what might be wrong, or can provide an alternative (and admittedly more up-to-date) answer, I'd be thankful!
(Also, first post, yada yada, so apologies if I've done something wrong, and please let me know so I can fix it!)