How to include/keep only the samples in a list in VCF.gz file?
3
1
Entering edit mode
3.4 years ago
DanielC ▴ 140

Dear Friends,

I have a list of 8000 samples in a file "samples.txt":

samples.txt:
TCGA..barcode..
TCGA..barcode..
.
.


I am using bcftools to only keep these samples in the vcf.gz file. The vcf.gz file has 10000 samples. Hence, I am trying to use bcftools to keep only the 8000 samples in the "samples.txt" file in the vcf.gz file and remove the remaining 2000 samples. I did:

bcftools -S samples.txt vcf.gz -o filtered-vcf.vcf

it gives me error:

[E::main] unrecognized command -S

Could you please suggest me what could be the issue here, and how you think I can do the above? Thanks much.

vcf samples bcftools • 4.2k views
8
Entering edit mode
3.4 years ago

subcommand 'view' is missing:

bcftools view -S samples.txt  -o filtered-vcf.vcf  vcf.gz

0
Entering edit mode

Thanks much Pierre! I ran this. however, it showed one error saying:

Error: subset called for sample that does not exist in header "TCGA..."

If am right, this means that the mentioned "TCGA.." sample in "samples.txt" is not present in the vcf.gz file? So, I used "--force-samples" to ignore this warning and it runs now.

0
Entering edit mode

I used similar command as bcftools view -S samplelist.txt input_file.vcf.gz -o newfiltered.vcf.gz to subset sample data from compressed vcf file. but got error message [w::bcf_sr_add_reader] No BGZF EOF markers; file 'input_file.vcf.gz' may be truncated. I have to abort the execution since I don't understand what this error message means. Could someone help me please.

0
Entering edit mode
3.0 years ago
DanielC ▴ 140

Hi mab658,

This issue basically arises when the vcf files are not properly uploaded or dowloaded from the source; could be due to internet issue or some other technical problem. Try to download the file again completely and run the command again. It should work.

Deepak

0
Entering edit mode
3 months ago
HL • 0

What about if I have a sample.txt file that has 10 000 samples and a vcf.gz that I know has less samples that txt file. How could I see with of the txt file samples I have in my vcf and how many of them. In the vcf might be also samples that are not in the txt file.

samples.txt: 11111 22222 33333 44444 55555 66666 77777

Samples from vcf.gz: 11111 22222 34567 66666 56789

Outputffile that I want is: 11111 22222 66666

And then of course I can get the amount easily with wc -l.

This bcftools view -S sample.txt vcf.gz --force-samples > outfile gives a file where is also some samples from the sample file that was not in the vcf.gz at first. Output that I had: 11111 22222 33333 44444 55555 66666 77777

0
Entering edit mode

Actually it does work right, so no problem anymore. I just didn't see that there were those few samples later in the vcf.gz file so it does take only 'overlaps' correctly.