Getting Sequence Based On Chromosome No And Coordinates From Whole Genome Fasta File
3
2
Entering edit mode
12.1 years ago
Anurag ▴ 20

I am having chromosome no or fasta header and coordinates and their orientation. For example

Chr:start:end:strand
chr8:1100023:1100050:+

(this list is quite long say about 15000 coordinates)

and whole genome sequence in fasta format in other file, for example:

>(fasta)chr8
sequence

Is there any tool or program or utility available to extract the sequences on the basis of coordinates from genome sequence file.

Any help will be highly appreciated.

Thanks in advance

sequence • 14k views
ADD COMMENT
11
Entering edit mode
12.1 years ago

Samtools can do this.

#First, index your fasta file (only have to do this once)
samtools faidx reference.fa

#then extract the sequences you want
samtools faidx ref.fasta 1:1234-9876
ADD COMMENT
3
Entering edit mode
12.1 years ago
Eric Fournier ★ 1.4k

You could also convert your coordinates into the standard BED format then use BEDtools' getfasta command to extract sequences.

sed -e 's/:/\t/ig' Your_coordinate_file > Coordinates.bed
bedtools getfasta -fi Your_fasta_File.fa -bed Coordinates.bed -fo Output_Sequences.fa
ADD COMMENT
1
Entering edit mode
12.1 years ago
Ido Tamir 5.2k

look at extractseq from emboss

you could do something like (not tested) in bash:

  extractseq chr8.fasta -reg $(awk 'BEGIN{FS=":" ; ORS=","}{print $2 ".." $3 }' inputfile) stdout -separate
ADD COMMENT

Login before adding your answer.

Traffic: 1920 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6