Question

Grep Position List

0

Entering edit mode

6.9 years ago

BAGeno ▴ 190

Hi,

I have VCF file and list of position file. I want to grep those records from VCF which have position corresponding to position present in position file. For this purpose I used this command.

grep -wfF Position.txt myVCF > insec.vcf

here is the list of position.

Here is the result

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  LP6005441-DNA_G11
chr1    10492   rs55998931      C       T       182.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=2.356;ClippingRankSum=0;DB;DP=4
chr1    10583   rs58108140      G       A       170.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-3.006;ClippingRankSum=0;DB;DP=
chr1    12783   rs62635284      G       A       522.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=4.292;ClippingRankSum=0;DB;DP=2
chr1    13116   rs62635286      T       G       125.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-0.301;ClippingRankSum=0;DB;DP=
chr1    13118   rs62028691      A       G       125.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-1.009;ClippingRankSum=0;DB;DP=
chr1    13273   rs531730856     G       C       675.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-2.577;ClippingRankSum=0;DB;DP=
chr1    13302   rs180734498     C       T       135.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=1.88;ClippingRankSum=0;DB;DP=41
chr1    13417   rs777038595     C       CGAGA   606.73  PASS   AC=1;AF=0.5;AN=2;BaseQRankSum=-2.053;ClippingRankSum=0;DB;DP=
chr1    13896   rs201696125     C       A       195.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-1.19;ClippingRankSum=0;DB;DP=5
chr1    14699   rs62635298      C       G       160.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=1.278;ClippingRankSum=0;DB;DP=3
chr1    14907   rs6682375       A       G       922.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-0.114;ClippingRankSum=0;DB;DP=

But it is giving some extra records as well like record containing 13415,13896. These positions are not present in list of positions. What shall I do to overcome this problem?

grep VCF • 5.2k views

ADD COMMENT • link updated 6.9 years ago by Maxime Lamontagne ★ 2.3k • written 6.9 years ago by BAGeno ▴ 190

0

Entering edit mode

Hi,

Use same command (without -w) by including <tab>YOUR_POSITION<tab> in 'Position.txt' file. This might work.

Example to add tabs:

sed -i 's/^/\t/' Position.txt

sed -i 's/$/\t/' Position.txt

ADD REPLY • link 6.9 years ago by EagleEye 7.5k

0

Entering edit mode

It worked :) Thank you

ADD REPLY • link 6.9 years ago by BAGeno ▴ 190

score 0 · Answer 1 · 2017-05-24

0

Entering edit mode

6.9 years ago

Istvan Albert 100k

Grep is not all that well suited for filtering VCF files, or in general, column-oriented files where you'd want to apply the filter on just one column. Perhaps you could extend the pattern to include the leading and trailing tabs.

See bcftools for a whole range of options for filtering that initially may look a little more complicated yet will serve you well as soon as you want to apply more diverse filters:

https://samtools.github.io/bcftools/bcftools.html

ADD COMMENT • link 6.9 years ago by Istvan Albert 100k

0

Entering edit mode

I had also used bcftools isec but it gave empty intersection files. I want to compare two VCF. These files my have same position but different rsIDs on that position.

ADD REPLY • link 6.9 years ago by BAGeno ▴ 190

0

Entering edit mode

Have you tried vcf-compare from vcftools?

ADD REPLY • link 6.9 years ago by dschika ▴ 320

0

Entering edit mode

On the theme of filtering column-oriented files, if you are prepared to code stuff up yourself, consider awk within the shell, or modules like csv within python etc.

ADD REPLY • link 6.9 years ago by Joe 21k

score 0 · Answer 2 · 2017-05-24

0

Entering edit mode

6.9 years ago

dschika ▴ 320

What about vcftools:

--positions

Include or exclude a set of sites on the basis of a list of positions in a file. Each line of the input file should contain a (tab-separated) chromosome and position. The file can have comment lines that start with a "#", they will be ignored.

ADD COMMENT • link 6.9 years ago by dschika ▴ 320

0

Entering edit mode

I tried this command but it did not work.

 vcftools --vcf myNew.vcf --positions Chr.pos.tsv --out Filtered_File

ADD REPLY • link 6.9 years ago by BAGeno ▴ 190

0

Entering edit mode

Try adding:

--recode

ADD REPLY • link 6.9 years ago by dschika ▴ 320

0

Entering edit mode

This is giving VCF file with just header.

ADD REPLY • link 6.9 years ago by BAGeno ▴ 190

0

Entering edit mode

Hm...could you please double check:

How does Chr.pos.tsv look like? Have you included chromosomes? Are chromosome and position tab-separated?

How does myNew.vcf look like? Is there more than just the header-section? Are there chromosome-position pairs matching your Chr.pos.tsv file?

ADD REPLY • link 6.9 years ago by dschika ▴ 320

0

Entering edit mode

This is what myNew.vcf look like.

##bcftools_filterCommand=filter -i DP>=20 myNew.raw30.vcf
##bcftools_filterCommand=filter -i GQ>=30 New.DP20.vcf
##liftOverProgram=CrossMap(https://sourceforge.net/projects/crossmap/)
##liftOverTime=April01,2017
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  LP6005441-DNA_G11
chr1    10492   rs55998931      C       T       182.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=2.356;ClippingRankSum=0;DB;DP=42;ExcessHet=3.0103;FS=0;MLEAC=1;
chr1    10583   rs58108140      G       A       170.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-3.006;ClippingRankSum=0;DB;DP=37;ExcessHet=3.0103;FS=0;MLEAC=1
chr1    12783   rs62635284      G       A       522.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=4.292;ClippingRankSum=0;DB;DP=29;ExcessHet=3.0103;FS=0;MLEAC=1;
chr1    13116   rs62635286      T       G       125.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-0.301;ClippingRankSum=0;DB;DP=28;ExcessHet=3.0103;FS=8.671;MLE
chr1    13118   rs62028691      A       G       125.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-1.009;ClippingRankSum=0;DB;DP=27;ExcessHet=3.0103;FS=5.521;MLE
chr1    13273   rs531730856     G       C       675.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-2.577;ClippingRankSum=0;DB;DP=46;ExcessHet=3.0103;FS=6.947;MLE

This is what Chr.pos.tsv look like.

##reference=file:///groups/reich/reference-genomes/hs37d5/unzipped/hs37d5.fa
##bcftools_viewVersion=1.2+htslib-1.2.1
##bcftools_viewCommand=view -h /n/data1/hms/genetics/reich/1000Genomes/cteam_remap/A-samples/LP6005441-DNA_G11/annotvcf/LP6005441-DNA_G11.annotated.vcf.bgz
##liftOverProgram=CrossMap(https://sourceforge.net/projects/crossmap/)
##liftOverFile=/data1/Sarah/database/hg19ToHg38.over.chain.gz
##new_reference_genome=/data1/Sarah/database/new_hg38.fa
##liftOverTime=May22,2017
#CHROM  POS
1       10248
1       10321
1       10327
1       10492
1       10583
1       12783
1       13116
1       13118
1       13273

ADD REPLY • link 6.9 years ago by BAGeno ▴ 190

0

Entering edit mode

Well, "chr1" is not equal "1" ...

ADD REPLY • link 6.9 years ago by dschika ▴ 320

WouterDeCoster · Answer 3 · 2017-05-24

0

Entering edit mode

6.9 years ago

Maxime Lamontagne ★ 2.3k

Why not use R with the merge command?

data1 <- read.table("VCF File", header=F, sep="\t")

data2 <- read.table("Position File", header=F, sep="\t") 
merge.results <- merge(data2, data1, by.x="V1", by.y="V2", all=F)

write.table(merge.results, "VCF Subset", col.names=F, row.names=F, quote=F, sep="\t")

ADD COMMENT • link updated 6.9 years ago by WouterDeCoster 47k • written 6.9 years ago by Maxime Lamontagne ★ 2.3k

0

Entering edit mode

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY • link 6.9 years ago by WouterDeCoster 47k