How to filter a VCF file with a list of CHR or contig IDs?
4
0
Entering edit mode
3.8 years ago
acoles ▴ 10

I need to subset/filter a SNP vcf file by a long list of non-sequential contig IDs, which appear in the CHR column. My VCF file contains 13,971 contigs currently, and I want to retain a specific set of 7,748 contigs and everything associated with those contigs (headers, all variants and genotype information etc.).

My contig list looks like:

dDocent_Contig_1

dDocent_Contig_100

dDocent_Contig_10000 etc.

I am considering the following script:

vcftools --vcf TotalRawSNPs.vcf --chr dDocent_Contig_1 --chr dDocent_Contig_100 (etc...) --recode --recode-INFO-all --out FinalRawSNPs

where I list every contig ID individually with a --chr flag before. For this --chr flag, I cannot feed it a text file of contig IDs to keep, which would be ideal. If I list all contigs individually, it'll create a massive script in the command line.

I’ve seen options for filtering by a list of individuals, but not any clear option for filtering by CHR/contig IDs only. Is there a more efficient way to filter my vcf file by CHR/contig?

The solution does not have to be strictly vcftools related. I'm open to exploring suitable awk/mawk/grep (etc.) options as well.

vcftools vcf chromosome contigs awk • 9.9k views
ADD COMMENT
2
Entering edit mode
3.8 years ago
drkennetz ▴ 560

If you had one contig per line (I'll call it contigs.txt) you could do a simple bash script using what you posted above:

cat contigs.txt | while read CONTIG; do
    vcftools --vcf TotalRawSNPs.vcf --chr $CONTIG --recode --recode-INFO-all --out $CONTIG.out
done

cat *.out >> FinalRawSNPs.vcf
rm -rf *.out

This would run the script for each contig in the list. I'm sure there are other ways to do this but this is a simple enough way that should be fairly fast.

ADD COMMENT
1
Entering edit mode

I have a question regarding this solution:

after concatenating all *.out into one file (FinalRawSNPs.vcf), is it really a vcf? Especially, the headers from *.out would be all over the lines. Maybe, one has to use bcftools merge like in this answer

ADD REPLY
0
Entering edit mode

Oh, that's a good point I didn't really think of! I think rather than catting you'd need to join them back together with some tool that understands the vcf format, like bcftools.

Thanks!

ADD REPLY
1
Entering edit mode

Hi, thank you both for the very helpful comments! This bash script worked out well. Since I was using vcf files, after creating the CHR output files, I bgzipped then indexed them all before using bcftools concat to combine all of the CHR files back together. Created some messy long text list of Contig IDs towards the top of the final concatenated VCF file, but all the information, headers and INFO flags were retained just fine. Thanks!

ADD REPLY
1
Entering edit mode
2.5 years ago
vkkodali_ncbi ★ 3.7k

Using bcftools, this can be done as follows:

## -r, --regions <region>  restrict to comma-separated list of regions
bcftools view -r 1,12,16 <in.vcf.gz> 

## -R, --regions-file <file>  restrict to regions listed in a file
bcftools view -R <regions.txt> <in.vcf.gz>
ADD COMMENT
0
Entering edit mode

I tried using bcftools

But the following problem persists:

For the first suggestion using -r, the list of over 10,000 region is too much for the program to process

For the second suggestion using -R, I get the following error:

[E::bcf_sr_regions_init] Could not parse 1-th line of file /path/to/region/file, using the columns 1,2[,-1]
Failed to read the regions: /path/to/region/file

Consists of just a list of scaffolds:

Scaffold1
Scaffold2
Scaffold3

I suspect that a 2nd column indicating the positions will be required, as specified in the manual: The columns of the tab-delimited file can contain either positions (two-column format) or intervals (three-column format): CHROM, POS, and, optionally, END, where positions are 1-based and inclusive.

However, I do not have such information and am interested in keeping all positions on interested scaffolds.

ADD REPLY
1
Entering edit mode

I believe the expectation is that the second and the third columns have the start and end positions for the range. In this instance, since you want the entire scaffold, they will be 1 and the length of the scaffold, respectively. You have a few choices to create this file:

  1. Manually -- most likely the least plausible/recommended choice
  2. If you have the genome FASTA, you can generate the FASTA index using something like samtools faidx <genome.fa> and use the first two columns of the resulting fai index file -- the first column is seq-id and the second column is the total length. You will need to use something like awk to make it look like a regions file.
  3. If you have a BAM file with the alignments you can extract the header for this information. the @SQ lines' second column is the sequence name with a SN: prefix and the third column is the length with LN: prefix. You can use samtools to extract the header only and then use awk to munge it to produce a 'regions' file.
ADD REPLY
0
Entering edit mode

So I had tried the 2nd suggestion, but columns 2 and 3 do not indicate start and end of the scaffold. And using column 2 as the end of each scaffold results in some weird results as well. The only way that I got it to work was to use the following, in a tab delimited file, and used as input with the -R flag.

scaffold1 1 99999999
scaffold2 1 99999999
scaffold3 1 99999999

This way all loci will be included for the specified scaffolds, assuming 99999999 exceeds the max scaffold length in the reference.

Regardless, thank you so much for the tips and advice, it helped point me in the right direction!

ADD REPLY
1
Entering edit mode

Sorry, I was wrong about the contents of the fai files -- the second column is the total length of the sequence. You can ignore the rest of the columns. I will edit my previous comment to fix the error.

ADD REPLY
0
Entering edit mode

Hi, I tried the second suggestion by vkkodali as well to remove small contigs (<1000 bps) in my .vcf file. The code worked, but the output had more contigs than the input (~30 times). I used wc -l to check after excluding the header lines. The commands I used is bcftools filter -R <region.txt> <input.vcf.gz> -Oz -o <output.vcf.gz>. The format of the <region.txt> is as below with seq-id from reference_genome.fai file:

>     scaffold1|size1212793 1   99999999
>     scaffold2|size1477633 1   99999999
>     scaffold3|size982680  1   99999999

Would be grateful if anyone can help!

ADD REPLY
0
Entering edit mode
2.5 years ago
Lilnet • 0

I want to do something similar, where I have 11280 scaffolds, but I want to exclude loci from 120 scaffolds.

However I am interested in finding a more direct method in which we get just 1 output from vcftools instead of extracting the loci from each contig and then putting it back into a single file.

My idea will be to use --not-chr and successively pipe the output through the 120 iterations to remove them and produce a file that contains loci from the 11280 - 120 scaffolds. But I am no computing whiz to know how to do that and if anyone is able to point me in the right direction.

Would appreciate if anyone has any suggestion!

ADD COMMENT
0
Entering edit mode
2.5 years ago
sbstevenlee ▴ 480

Late to the party, but here is another solution using Python API and the pyvcf submodule I wrote:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['dDocent_Contig_1', 'dDocent_Contig_2', 'dDocent_Contig_3', 'dDocent_Contig_4'],
...     'POS': [100, 100, 100, 100],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'C', 'C'],
...     'ALT': ['A', 'C', 'A', 'T'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'A': ['0/1', '1/1', '0/1', '0/1']
... }
>>> # vf = pyvcf.VcfFrame.from_file('in.vcf')
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
              CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A
0  dDocent_Contig_1  100  .   G   A    .      .    .     GT  0/1
1  dDocent_Contig_2  100  .   T   C    .      .    .     GT  1/1
2  dDocent_Contig_3  100  .   C   A    .      .    .     GT  0/1
3  dDocent_Contig_4  100  .   C   T    .      .    .     GT  0/1
>>> # Get the contig list from a file (option #1)
>>> # contigs = []
>>> # with open('contig.list') as f:
>>> #     for line in f:
>>> #         contigs.append(line.strip())
>>>         
>>> # Get the contig list from a file (option #2)
>>> # from fuc import common
>>> # contigs = common.convert_file2list('contig.list')
>>> 
>>> contigs = ['dDocent_Contig_1', 'dDocent_Contig_3']
>>> vf.df = vf.df[vf.df.CHROM.isin(contigs)]
>>> vf.df
              CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A
0  dDocent_Contig_1  100  .   G   A    .      .    .     GT  0/1
1  dDocent_Contig_3  100  .   C   A    .      .    .     GT  0/1
>>> # vf.to_file('out.vcf')
ADD COMMENT

Login before adding your answer.

Traffic: 1985 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6