Question

Automate the Splitting of a VCF File by Sample (bcftools)

1

Entering edit mode

7 months ago

RogueBiochemist • 0

Hi! I'm very new to working with large .vcf files, and am trying to split up a particular file by strain (sample). There are about 1500 samples in the file, so going through manually isn't really an option (although I have managed to get it to work). My problem has been with trying to loop the process.

Having read through similar posts (here and here, specifically this reply), I attempted to run the following from the working directory:

for sample in `bcftools query -l <FILE_NAME>`; do
    bcftools view -c1 -Oz -s $sample -o <PATH>/$sample.vcf.gz <FILE_NAME> 
done

Which iterates through each sample. But for every sample (excepting the final one in the file), it produces the following error - ". Use "--force-samples" to ignore this error.exist in header: "<SAMPLE>. I've also tried getting the sample names into a text file, and then reading from the text file:

while read sample; do
   bcftools view -c1 -Oz -s $sample -o <PATH>/$sample.vcf.gz <FILE_NAME> 
done < samples.txt

Reading through the documentation for view, I'm under the impression that --force-samples won't fix my problem, because that'll just ignore samples that bcftools doesn't think exists (which appears to be pretty much all of them). And my rather frenzied Googling the error has brought up nothing of note.

For completeness sake, I've also tried using bcftools plugin split <FILE_NAME> -Oz -o <PATH>, although this fails because my department's cluster cannot open that many files at once. What confuses me is that the command works on individual samples - even if I set it at a separate variable to use the $sample, it'll work. It's just looping it that seems to make it fail? So I'm at a loss - I'm guessing there's something wrong with how the sample names are being formatted within the command, but my understanding of bash is not good enough to figure out why.

If anyone knows what might be wrong, or can provide an alternative (and admittedly more up-to-date) answer, I'd be thankful!

(Also, first post, yada yada, so apologies if I've done something wrong, and please let me know so I can fix it!)

bcftools vcf • 1.5k views

ADD COMMENT • link updated 6 months ago by Ram 43k • written 7 months ago by RogueBiochemist • 0

1

Entering edit mode

Just a comment re: your bash example reading from a file. Have you tried while instead of 'with'?

while read sample; do
   bcftools view -c1 -Oz -s $sample -o <PATH>/$sample.vcf.gz <FILE_NAME> 
done < samples.txt

ADD REPLY • link 7 months ago by cfos4698 ★ 1.1k

0

Entering edit mode

Gah, you're right! It's what I put originally, aye - when I wrote this thread I must've had Python on the brain! I'll update the code, thanks (for note, it doesn't change anything)

ADD REPLY • link 7 months ago by RogueBiochemist • 0

1

Entering edit mode

7 months ago

Ram 43k

The manual says you can split samples out at once using the +split plugin. So, bcftools +split -Oz -o <PATH> file.vcf.gz should do the trick. The question now is - how can you use the -i or -e to keep only sites that pass the -c1 filter. I'd say use -i "INFO/AC>=1". So, try:

bcftools +split -i "INFO/AC>=1" -Oz -o <PATH> file.vcf.gz

You can even specify samples to retain and/or rename, combine, etc in a detailed manner using the -S parameter:

   -S, --samples-file FILE         List of samples to keep with up to three columns, one line per output file:

                                       # Create two output files, the first sample is the basename
                                       # of the new file
                                       sample1
                                       sample2,sample3

                                       # Optional second column to rename the samples
                                       sample1           new-name2
                                       sample2,sample3   new-name2,new-name3

                                       # Optional third column to provide output file base name, use dash "-"
                                       # to keep sample names unchanged
                                       sample1           new-name1   file1
                                       sample2,sample3   -           file2

ADD COMMENT • link 7 months ago by Ram 43k

0

Entering edit mode

Thanks! To be fair, keeping -c1 shouldn't be a massive issue, because this only the first step in a pipeline that'll eliminate most of the variants anyways. Unfortunately, +split also doesn't co-operate on my cluster (same with plugin split) - the way it functions is apparently similar enough to the former that the same error occurs

ADD REPLY • link 6 months ago by RogueBiochemist • 0

1

Entering edit mode

I think +split and plugin split are not just similar, they're the same. What do you mean by "doesn't co-operate on my cluster"? What version of bcftools are you using?

ADD REPLY • link 6 months ago by Ram 43k

1

Entering edit mode

7 months ago

Pierre Lindenbaum 161k

My problem has been with trying to loop the process.

ok here is a nextflow based solution,(NOT tested)

workflow {
   ch1 = EXTRACT_SAMPLES(params.vcf);
   EXTRACT_SAMPLE( ch1.output.splitText().map{it.trim()}, params.vcf)
}

process EXTRACT_SAMPLES {
input:
val(vcf)
output:
path("samples.txt"),emit:output
script:
"""
bcftools query -l "${vcf}" > samples.txt
"""
}


process EXTRACT_SAMPLE {
input:
   tuple val(sample),val(vcf)
output:
path("${sample}.bcf"),emit:output
script:
"""
bcftools view --samples "${sample}"  -O u   "${vcf}"|  bcftools view -i 'AC>0' -O b -o '${sample}.bcf'
"""
}

ADD COMMENT • link 7 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you! Thankfully I found a really easy solution (forceful removal of the any control characters), but the help is greatly appreciated!

ADD REPLY • link 6 months ago by RogueBiochemist • 0

score 1 · Accepted Answer · 2023-10-01

1

Entering edit mode

6 months ago

RogueBiochemist • 0

In the end, changing $sample to ${sample//[$'\r\n']} (i.e. replace any carriage return or newline characters with '') worked!

while read sample; do
   bcftools view -c1 -Oz -s ${sample//[$'\r\n']} -o <PATH>/${sample//[$'\r\n']}.vcf.gz <FILE_NAME> 
done < samples.txt

ADD COMMENT • link 6 months ago by RogueBiochemist • 0

1

Entering edit mode

Might be an OS problem, this might bypass the need to manually remove new line characters:

dos2unix samples.txt
while read sample; do
   bcftools view -c1 -Oz -s ${sample//[$'\r\n']} -o <PATH>/${sample}.vcf.gz <FILE_NAME> 
done < samples.txt

ADD REPLY • link 6 months ago by Ram 43k