Modify reference fasta using bed file
Entering edit mode
3.9 years ago
Rubal ▴ 350

Hello Everyone,

I would like to modify a reference genome fasta file using a list of position in bed file format. The bed file contains positions like this:

chr1    17716   C       G
chr1    17925   A       G
chr1    18115   C       T

So for example I would like to change the position at chr1 17716 from C to G in the fasta file. I have seen some python packages and a GATK tool for modifying a fasta file using a list of VCF positions but was wondering if there was any tool that would do this using a bed file as input rather than a VCF. Or would it be better to convert the bed file to VCF format by adding a VCF header (I'm not sure if this would work?).

Thanks in advance for your help.

fasta bed genome • 1.5k views
Entering edit mode
3.9 years ago
AK ★ 2.2k

If SNPs only:

$ cat example.fa
$ cat example.bed
chr1    2   C   G
chr1    4   C   G
chr1    6   C   G
chr2    1   A   T
chr2    3   A   T
chr2    5   A   T

$ seqkit fx2tab example.fa >
$ awk 'FILENAME=="" {fa[$1]=$2; next} {fa[$1]=substr(fa[$1], 1, $2-1) $4 substr(fa[$1], $2+1, length(fa[$1])-$2)} END {for (id in fa){print ">" id "\n" fa[id]}}' example.bed
Entering edit mode

this works well thank you! I am trying to run it changing over a million positions in a 2gb fasta file. Do you recommend any change in approach for this? I'm a little concerned as it produces no intermediate output that I can't tell how long it will take. For example would sorting the example.bed speed it up or does it search through the file for each position without sorting making a difference. The only improvement I can think of would be to sort and split the fasta and bed files into smaller pieces, run independently and then merge at the end.


Login before adding your answer.

Traffic: 2664 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6