Extracting snps from a range of coordinates
4
2
Entering edit mode
4.1 years ago
S AR ▴ 80

I have a list of genes. and i want to extract the snps and indels from my VCF file (that i generated using GATK pipeline ) from genes coordinates on . The list of genes coordinates:

Gene Name   Accession_no.   Start_Position   End_Position   Strand
Rv0194         NC_000962.3       226878         230462            +


I was looking bedtools but it is asking for .bed format of genes nd as well .bed of bam files. how to do it ? or any other options/tools/scripts?

Like i tried tabix:

bgzip ERR038736_UnifiedGenotyper_variants_raw_snp.vcf
tabix ERR038736_UnifiedGenotyper_variants_raw_snp.vcf.gz
tabix ERR038736_UnifiedGenotyper_variants_raw_snp.vcf.gz AL123456.3:226878-230462 > Rv0194


and this gave me the variants like this:

AL123456.3      227098  .       T       C       6730.77 .       AC=2;AF=1.00;AN=2;DP=172;Dels=0.
AL123456.3      228069  .       G       A       7132.77 .       AC=2;AF=1.00;AN=2;BaseQRankSum=-
AL123456.3      228168  .       G       C       6682.77 .       AC=2;AF=1.00;AN=2;DP=171;Dels=0.


But this is not a vcf file and i can only extract it one at a time. I want to extract all variants against a list of coordinates and store it in a vcf output.

Can anyone help me it this?

vcf SUB-SET SNP gene coordinates • 2.5k views
4
Entering edit mode
4.1 years ago

Hello,

you can convert this list into a valid bedfile by:

$cut -f2-4 genes.txt|tail -n+2 > genes.bed  You can than take this bed file with tabix $ tabix input.vcf.gz -R genes.bed


fin swimmer

0
Entering edit mode

can you please explain what -f2-4 is doing? and what -n+2 is doing

0
Entering edit mode

Sure :)

• cut -f2-4 select the columns 2 to 4 which contain the Accession_no., start and end position
• tail -n+2 prints all from the second line onward. This is necessary to get rid of the header line.

fin swimmer

0
Entering edit mode

Oh great it superb thanks ..

1
Entering edit mode

Hello angelshiza,

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work. This will aid others who have similar problems in the future.

0
Entering edit mode

it doesnot give the output in vcf format and i want the vcf file in the end

1
Entering edit mode

Is this a comment to my answer?

tabix will output a vcf. If you like to output the header data as well type:

$tabix -h input.vcf.gz -R genes.bed  fin swimmer ADD REPLY 0 Entering edit mode Oh that is great thank you! ADD REPLY 3 Entering edit mode 4.1 years ago BEDOPS: $ vcf2bed --insertions < snps.vcf > insertions.bed
$vcf2bed --deletions < snps.vcf > deletions.bed$ vcf2bed --snvs < snps.vcf > snvs.bed


0
Entering edit mode
4.1 years ago

Check bedtools intersect angelshiza

0
Entering edit mode
4.1 years ago
Tm ★ 1.1k

If you have the gtf/gff3 file of gene annotation along with variant's .vcf file then I will suggest you to annotate your variants using snpeff as it will not only provide information on which SNPs belongs to which gene but it will also determine the effect of the variants.

0
Entering edit mode

No i dont have gff/gft. I have multiple genes with coordinate can i make gff/gft of many genes in one file ?? Can you share if you know.

Thank you Sar

0
Entering edit mode

How you have identified the genes? means have you used any specific tool for gene prediction?

0
Entering edit mode

i have the gene list with coordinates:

Gene Name   Accession_no.   Start_Position   End_Position   Strand
Rv0194         NC_000962.3       226878         230462            +
Rv3695         NC_000962.3       226888         230452            +
Rv1144         NC_000962.3       66878         98462            +