Automate the Splitting of a VCF File by Sample (bcftools)
3
1
Entering edit mode
7 months ago

Hi! I'm very new to working with large .vcf files, and am trying to split up a particular file by strain (sample). There are about 1500 samples in the file, so going through manually isn't really an option (although I have managed to get it to work). My problem has been with trying to loop the process.

Having read through similar posts (here and here, specifically this reply), I attempted to run the following from the working directory:

for sample in `bcftools query -l <FILE_NAME>`; do
    bcftools view -c1 -Oz -s $sample -o <PATH>/$sample.vcf.gz <FILE_NAME> 
done

Which iterates through each sample. But for every sample (excepting the final one in the file), it produces the following error - ". Use "--force-samples" to ignore this error.exist in header: "<SAMPLE>. I've also tried getting the sample names into a text file, and then reading from the text file:

while read sample; do
   bcftools view -c1 -Oz -s $sample -o <PATH>/$sample.vcf.gz <FILE_NAME> 
done < samples.txt

Reading through the documentation for view, I'm under the impression that --force-samples won't fix my problem, because that'll just ignore samples that bcftools doesn't think exists (which appears to be pretty much all of them). And my rather frenzied Googling the error has brought up nothing of note.

For completeness sake, I've also tried using bcftools plugin split <FILE_NAME> -Oz -o <PATH>, although this fails because my department's cluster cannot open that many files at once. What confuses me is that the command works on individual samples - even if I set it at a separate variable to use the $sample, it'll work. It's just looping it that seems to make it fail? So I'm at a loss - I'm guessing there's something wrong with how the sample names are being formatted within the command, but my understanding of bash is not good enough to figure out why.

If anyone knows what might be wrong, or can provide an alternative (and admittedly more up-to-date) answer, I'd be thankful!

(Also, first post, yada yada, so apologies if I've done something wrong, and please let me know so I can fix it!)

bcftools vcf • 1.5k views
ADD COMMENT
1
Entering edit mode

Just a comment re: your bash example reading from a file. Have you tried while instead of 'with'?

while read sample; do
   bcftools view -c1 -Oz -s $sample -o <PATH>/$sample.vcf.gz <FILE_NAME> 
done < samples.txt
ADD REPLY
0
Entering edit mode

Gah, you're right! It's what I put originally, aye - when I wrote this thread I must've had Python on the brain! I'll update the code, thanks (for note, it doesn't change anything)

ADD REPLY
1
Entering edit mode
6 months ago

In the end, changing $sample to ${sample//[$'\r\n']} (i.e. replace any carriage return or newline characters with '') worked!

while read sample; do
   bcftools view -c1 -Oz -s ${sample//[$'\r\n']} -o <PATH>/${sample//[$'\r\n']}.vcf.gz <FILE_NAME> 
done < samples.txt
ADD COMMENT
1
Entering edit mode

Might be an OS problem, this might bypass the need to manually remove new line characters:

dos2unix samples.txt
while read sample; do
   bcftools view -c1 -Oz -s ${sample//[$'\r\n']} -o <PATH>/${sample}.vcf.gz <FILE_NAME> 
done < samples.txt
ADD REPLY
1
Entering edit mode
7 months ago
Ram 43k

The manual says you can split samples out at once using the +split plugin. So, bcftools +split -Oz -o <PATH> file.vcf.gz should do the trick. The question now is - how can you use the -i or -e to keep only sites that pass the -c1 filter. I'd say use -i "INFO/AC>=1". So, try:

bcftools +split -i "INFO/AC>=1" -Oz -o <PATH> file.vcf.gz

You can even specify samples to retain and/or rename, combine, etc in a detailed manner using the -S parameter:

   -S, --samples-file FILE         List of samples to keep with up to three columns, one line per output file:

                                       # Create two output files, the first sample is the basename
                                       # of the new file
                                       sample1
                                       sample2,sample3

                                       # Optional second column to rename the samples
                                       sample1           new-name2
                                       sample2,sample3   new-name2,new-name3

                                       # Optional third column to provide output file base name, use dash "-"
                                       # to keep sample names unchanged
                                       sample1           new-name1   file1
                                       sample2,sample3   -           file2
ADD COMMENT
0
Entering edit mode

Thanks! To be fair, keeping -c1 shouldn't be a massive issue, because this only the first step in a pipeline that'll eliminate most of the variants anyways. Unfortunately, +split also doesn't co-operate on my cluster (same with plugin split) - the way it functions is apparently similar enough to the former that the same error occurs

ADD REPLY
1
Entering edit mode

I think +split and plugin split are not just similar, they're the same. What do you mean by "doesn't co-operate on my cluster"? What version of bcftools are you using?

ADD REPLY
1
Entering edit mode
7 months ago

My problem has been with trying to loop the process.

ok here is a nextflow based solution,(NOT tested)

workflow {
   ch1 = EXTRACT_SAMPLES(params.vcf);
   EXTRACT_SAMPLE( ch1.output.splitText().map{it.trim()}, params.vcf)
}

process EXTRACT_SAMPLES {
input:
val(vcf)
output:
path("samples.txt"),emit:output
script:
"""
bcftools query -l "${vcf}" > samples.txt
"""
}


process EXTRACT_SAMPLE {
input:
   tuple val(sample),val(vcf)
output:
path("${sample}.bcf"),emit:output
script:
"""
bcftools view --samples "${sample}"  -O u   "${vcf}"|  bcftools view -i 'AC>0' -O b -o '${sample}.bcf'
"""
}
ADD COMMENT
0
Entering edit mode

Thank you! Thankfully I found a really easy solution (forceful removal of the any control characters), but the help is greatly appreciated!

ADD REPLY

Login before adding your answer.

Traffic: 1764 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6