I tried bedtools getfasta and i get the errors that chromosome was not found in fasta file but I have triple checked it there is no blank space the chromosome name in bed file is exactly the same as in fasta file. I would like to know is there any alternatives other than using bedtools getfasta in order to extract the sequence.
Question: How to use Bed file to extract sequence from FASTA file?
2
allyson1115ar • 20 wrote:
ADD COMMENT
• link
•
modified 5.6 years ago
by
Maximilian Haeussler • 1.4k
•
written
5.7 years ago by
allyson1115ar • 20
1
Matt Shirley ♦ 9.5k wrote:
You might try the "faidx" command from https://github.com/mdshw5/pyfaidx. There is a --default-seq parameter that allows filling in nonexistent sequence, as well as a --lazy parameter that disables bounds checking. You can pass a bed file using --bed.
1
Maximilian Haeussler • 1.4k wrote:
Use a command like this:
twoBitToFa http://hgdownload.cse.ucsc.edu/gbdb/hg19/hg19.2bit -bed=input.bed test.fa
or for a single region:
twoBitToFa http://hgdownload.cse.ucsc.edu/gbdb/hg19/hg19.2bit test.fa -seq=chr21 -start=1 -end=10000
Requires the UCSC tool twoBitToFa, available from http://hgdownload.cse.ucsc.edu/admin/exe/
If you're not on the hg19 genome, you have to index your .fa file first with faToTwoBit
Please log in to add an answer.
Use of this site constitutes acceptance of our User
Agreement
and Privacy
Policy.
Powered by Biostar
version 2.3.0
Traffic: 1921 users visited in the last hour
samtools faidx extracts subsequence from indexed reference sequences (http://samtools.sourceforge.net/samtools.shtml)
Also use search feature to look for similar posts on this forum.
I tried. But can't solve it.
Solved well. Thanks.
There is one problem in this tool. The result is not accurate. The sequence extracted is not the same as in bed file coordinates
It's more likely that you made an error than that samtools faidx did...
This is a part of the output i get from samtools faidx:
>chr1:179757197-179758470:1251-1255
TGAGT
>chr1:201237463-201238874:41-45
>chN
>chr1:201237463-201238874:62-80
238874
TGCCACAGCTGN
>chr1:201237463-201238874:62-81
238874
You appear to have used a heavily mistyped command (e.g., chN, also the ranges are non-sensical). Post the exact command that you used and mention where you got the genome.