Question

Extracting information from VCF file for many specific positions in specific chromosomes

0

Entering edit mode

11 months ago

mohsamir2016 ▴ 30

Dear all, I have an excel file that I created from VCF file for common SNPs across 6 samples. This excel have the chromosomes and the position of the SNPs only (see example table1) Table1 Now I would like to obtain the other information (eg. allels, Genotype, depth, etc) from the VCF files of the 6 samples (i.e. the one that contains these positions).
I tried using AWK command like here for position 23432 on chr. 1 for the 6 file :

awk -F " " '$1=="1" && $2=="23432"' file1.vcf
awk -F " " '$1=="1" && $2=="23432"' file2.vcf
awk -F " " '$1=="1" && $2=="23432"' file3.vcf
awk -F " " '$1=="1" && $2=="23432"' file4.vcf
awk -F " " '$1=="1" && $2=="23432"' file5.vcf
awk -F " " '$1=="1" && $2=="23432"' file6.vcf

he issue is that these SNPs I have are thousands positions, so I need an automated way to do this

Could you advise on that ?

Thanks

SNP RNA GATK seq • 714 views

ADD COMMENT • link updated 11 months ago by Pierre Lindenbaum 161k • written 11 months ago by mohsamir2016 ▴ 30

score 0 · Answer 1 · 2023-05-24

0

Entering edit mode

11 months ago

Pierre Lindenbaum 161k

Ses the option --regions-file of bcftools view.

ADD COMMENT • link 11 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

I went into the bcftools view -R but I could not understand it from the documentation. Could you please give me an example code that I can run and test the results ?

Thanks

ADD REPLY • link 11 months ago by mohsamir2016 ▴ 30

0

Entering edit mode

what don't you understand from the documentation ?

Regions can be specified either on command line or in a VCF, BED, or tab-delimited file (the default). The columns of the tab-delimited file can contain either positions (two-column format: CHROM, POS) or intervals (three-column format: CHROM, BEG, END), but not both. Positions are 1-based and inclusive. The columns of the tab-delimited BED file are also CHROM, POS and END (trailing columns are ignored), but coordinates are 0-based, half-open. To indicate that a file be treated as BED rather than the 1-based tab-delimited file, the file must have the ".bed" or ".bed.gz" suffix (case-insensitive). Uncompressed files are stored in memory, while bgzip-compressed and tabix-indexed region files are streamed. Note that sequence names must match exactly, "chr20" is not the same as "20". Also note that chromosome ordering in FILE will be respected, the VCF will be processed in the order in which chromosomes first appear in FILE. However, within chromosomes, the VCF will always be processed in ascending genomic coordinate order no matter what order they appear in FILE. Note that overlapping regions in FILE can result in duplicated out of order positions in the output. This option requires indexed VCF/BCF files.

ADD REPLY • link 11 months ago by Pierre Lindenbaum 161k