Question: Grep Position List
0
gravatar for BAGeno
3.2 years ago by
BAGeno170
BAGeno170 wrote:

Hi,

I have VCF file and list of position file. I want to grep those records from VCF which have position corresponding to position present in position file. For this purpose I used this command.

grep -wfF Position.txt myVCF > insec.vcf

here is the list of position.

`POS
10248
10321
10327
10492
10583
12783
13116
13118
13273
13302
14354
14542
14907
14930
15029
15118
15190
15208
15211
15274
`

Here is the result

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  LP6005441-DNA_G11
chr1    10492   rs55998931      C       T       182.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=2.356;ClippingRankSum=0;DB;DP=4
chr1    10583   rs58108140      G       A       170.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-3.006;ClippingRankSum=0;DB;DP=
chr1    12783   rs62635284      G       A       522.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=4.292;ClippingRankSum=0;DB;DP=2
chr1    13116   rs62635286      T       G       125.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-0.301;ClippingRankSum=0;DB;DP=
chr1    13118   rs62028691      A       G       125.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-1.009;ClippingRankSum=0;DB;DP=
chr1    13273   rs531730856     G       C       675.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-2.577;ClippingRankSum=0;DB;DP=
chr1    13302   rs180734498     C       T       135.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=1.88;ClippingRankSum=0;DB;DP=41
chr1    13417   rs777038595     C       CGAGA   606.73  PASS   AC=1;AF=0.5;AN=2;BaseQRankSum=-2.053;ClippingRankSum=0;DB;DP=
chr1    13896   rs201696125     C       A       195.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-1.19;ClippingRankSum=0;DB;DP=5
chr1    14699   rs62635298      C       G       160.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=1.278;ClippingRankSum=0;DB;DP=3
chr1    14907   rs6682375       A       G       922.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-0.114;ClippingRankSum=0;DB;DP=

But it is giving some extra records as well like record containing 13415,13896. These positions are not present in list of positions. What shall I do to overcome this problem?

grep vcf • 2.2k views
ADD COMMENTlink modified 3.2 years ago by Maxime Lamontagne2.2k • written 3.2 years ago by BAGeno170

Hi,

Use same command (without -w) by including <tab>YOUR_POSITION<tab> in 'Position.txt' file. This might work.

Example to add tabs:

sed -i 's/^/\t/' Position.txt

sed -i 's/$/\t/' Position.txt
ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by EagleEye6.6k

It worked :) Thank you

ADD REPLYlink written 3.2 years ago by BAGeno170
0
gravatar for Istvan Albert
3.2 years ago by
Istvan Albert ♦♦ 84k
University Park, USA
Istvan Albert ♦♦ 84k wrote:

Grep is not all that well suited for filtering VCF files, or in general, column-oriented files where you'd want to apply the filter on just one column. Perhaps you could extend the pattern to include the leading and trailing tabs.

See bcftools for a whole range of options for filtering that initially may look a little more complicated yet will serve you well as soon as you want to apply more diverse filters:

https://samtools.github.io/bcftools/bcftools.html

ADD COMMENTlink written 3.2 years ago by Istvan Albert ♦♦ 84k

I had also used bcftools isec but it gave empty intersection files. I want to compare two VCF. These files my have same position but different rsIDs on that position.

ADD REPLYlink written 3.2 years ago by BAGeno170

Have you tried vcf-compare from vcftools?

ADD REPLYlink written 3.2 years ago by dschika300

On the theme of filtering column-oriented files, if you are prepared to code stuff up yourself, consider awk within the shell, or modules like csv within python etc.

ADD REPLYlink written 3.2 years ago by Joe17k
0
gravatar for dschika
3.2 years ago by
dschika300
European Union
dschika300 wrote:

What about vcftools:

--positions

Include or exclude a set of sites on the basis of a list of positions in a file. Each line of the input file should contain a (tab-separated) chromosome and position. The file can have comment lines that start with a "#", they will be ignored.

ADD COMMENTlink written 3.2 years ago by dschika300

I tried this command but it did not work.

 vcftools --vcf myNew.vcf --positions Chr.pos.tsv --out Filtered_File
ADD REPLYlink written 3.2 years ago by BAGeno170

Try adding:

--recode
ADD REPLYlink written 3.2 years ago by dschika300

This is giving VCF file with just header.

ADD REPLYlink written 3.2 years ago by BAGeno170

Hm...could you please double check:

How does Chr.pos.tsv look like? Have you included chromosomes? Are chromosome and position tab-separated?

How does myNew.vcf look like? Is there more than just the header-section? Are there chromosome-position pairs matching your Chr.pos.tsv file?

ADD REPLYlink written 3.2 years ago by dschika300

This is what myNew.vcf look like.

##bcftools_filterCommand=filter -i DP>=20 myNew.raw30.vcf
##bcftools_filterCommand=filter -i GQ>=30 New.DP20.vcf
##liftOverProgram=CrossMap(https://sourceforge.net/projects/crossmap/)
##liftOverTime=April01,2017
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  LP6005441-DNA_G11
chr1    10492   rs55998931      C       T       182.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=2.356;ClippingRankSum=0;DB;DP=42;ExcessHet=3.0103;FS=0;MLEAC=1;
chr1    10583   rs58108140      G       A       170.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-3.006;ClippingRankSum=0;DB;DP=37;ExcessHet=3.0103;FS=0;MLEAC=1
chr1    12783   rs62635284      G       A       522.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=4.292;ClippingRankSum=0;DB;DP=29;ExcessHet=3.0103;FS=0;MLEAC=1;
chr1    13116   rs62635286      T       G       125.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-0.301;ClippingRankSum=0;DB;DP=28;ExcessHet=3.0103;FS=8.671;MLE
chr1    13118   rs62028691      A       G       125.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-1.009;ClippingRankSum=0;DB;DP=27;ExcessHet=3.0103;FS=5.521;MLE
chr1    13273   rs531730856     G       C       675.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-2.577;ClippingRankSum=0;DB;DP=46;ExcessHet=3.0103;FS=6.947;MLE

This is what Chr.pos.tsv look like.

##reference=file:///groups/reich/reference-genomes/hs37d5/unzipped/hs37d5.fa
##bcftools_viewVersion=1.2+htslib-1.2.1
##bcftools_viewCommand=view -h /n/data1/hms/genetics/reich/1000Genomes/cteam_remap/A-samples/LP6005441-DNA_G11/annotvcf/LP6005441-DNA_G11.annotated.vcf.bgz
##liftOverProgram=CrossMap(https://sourceforge.net/projects/crossmap/)
##liftOverFile=/data1/Sarah/database/hg19ToHg38.over.chain.gz
##new_reference_genome=/data1/Sarah/database/new_hg38.fa
##liftOverTime=May22,2017
#CHROM  POS
1       10248
1       10321
1       10327
1       10492
1       10583
1       12783
1       13116
1       13118
1       13273
ADD REPLYlink written 3.2 years ago by BAGeno170

Well, "chr1" is not equal "1" ...

ADD REPLYlink written 3.2 years ago by dschika300
0
gravatar for Maxime Lamontagne
3.2 years ago by
Québec
Maxime Lamontagne2.2k wrote:

Why not use R with the merge command?

data1 <- read.table("VCF File", header=F, sep="\t")

data2 <- read.table("Position File", header=F, sep="\t") 
merge.results <- merge(data2, data1, by.x="V1", by.y="V2", all=F)

write.table(merge.results, "VCF Subset", col.names=F, row.names=F, quote=F, sep="\t")
ADD COMMENTlink modified 3.2 years ago by WouterDeCoster44k • written 3.2 years ago by Maxime Lamontagne2.2k

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 3.2 years ago by WouterDeCoster44k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 694 users visited in the last hour