Extract User Defined Region From An Fasta File
5
11
Entering edit mode
10.4 years ago
Palu ▴ 290

I have a file with multi fasta format sequence

>lyrata FEN1
KEDIEKYSKRTVKVTKQHNDDCKRLLRLMGVPVVEATSEAEAQCAALCKSGKVYGVASED
MDSLTFGAPKFLRHLMDPSSRKIPVMEFEVAKILEELQLTMDQFIDLCILSGCDYCDSIR
GIGGQTALKLIRQHGSIETILENLNKERYQIPEEWPYNEARKLFKEPDVITDEEQLDIKW
TSPDEEGIVQFLVNENGFNIDRVTKAIEKIKTAKNKSSQGRLESFFKPVANSSVPAKRKG
NLICLVVSEVIPQYTGSQGTKCLLRQYKPAISNKVSSLHAFGWNSSKNIVSIRLFSLSWG
FNLSEIPESTTKGAANKKTKGAGGRKKK

>Arabidopsis thaliana FEN1
KEDIEKYSKRTVKVTKQHNDDCKRLLRLMGVPVVEATSEAEAQCAALCKSGKVYGVASED
MDSLTFGAPKFLRHLMDPSSRKIPVMEFEVAKILEELQLTMDQFIDLCILSGCDYCDSIR
GIGGQTALKLIRQHGSIETILENLNKERYQIPEEWPYNEARKLFKEPDVITDEEQLDIKW
TSPDEEGIVQFLVNENGFNIDRVTKAIEKIKTAKNKSSQGRLESFFKPVANSSVPAKRKG
NLICLVASEVIPNILDHRVPNACSGSINQPSAIKSPLCMLLDGTPPITLLALGCSLYPGL
PTFRFTWLEEIPESTTKGAANKKTKGAGGRKKK

>Brachypodium distachyon - II FEN1
MGIKGLTKLLAEHAPRAAAQRRVEDYRGRVIAIDASLSIYQFLVVVGRKGTEVLTNEAEG
LTVDCYARFVFDGEPPDLKKRELAKRSLRRDDASEDLNRAIEVGDEDSIEKFSKRTVKIT
KKHNDDCKKLLRLMGVPVVEAPGEAEAQCASLCKNHKAYAVASEDMDSLTFGSLRFLRHI
TDLSFKRSPVTEFEVPKVLEELGLTMDQFIDLCILSGCDYCENIKGIGGQRALKLIRQHG
CIEEVVQNLNNRFTVPEDWPYQEVRTLFKEPNVCTEIPDFQWTSVDKEGIVNFLAIENSF
SSDRVEKAVQKIKAARDRYSPGRVKLLTPVANLSGSITKKEPECVLGSSGQGLKSWSALQ


and i want to extract sequence of defined region. these parameters are present in other file

lyrata            1      108
lyrata             147    234
Arabidopsis          1    108
Arabidopsis        147    234
Brachypodium         1     93
Brachypodium       132    219


I have asked similar question before and got an awk solution but it gives error. Please help me. i am a window user.

sequence fasta • 52k views
1
Entering edit mode

HI, thanks for answering my previous and this question.. I found a very good solution of my problem ans love to share with ther have same problem. visit this page. it's working good

0
Entering edit mode

This is not a discussion forum, It's Q&A. Youve posted this as an 'answer' which doesn't make sense. To reply to another answer use the comment button underneath it, like I am here.

26
Entering edit mode
10.4 years ago
Fred ▴ 760

You can also use the samtools package. Samtools allows you to index a fasta file. Here is an excerpt from the manual:

samtools faidx ref.fasta [region1[...]] : Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.fai on the disk. If regions are specified, the subsequences will be retrieved and printed to stdout in the FASTA format.

Once you indexed your reference, you can query it.

For example:

samtools faidx reference.fasta
samtools faidx reference.fasta lyrata:1-108


Samtools is designed to handle nucleotides, so I am not sure that it will work with protein sequences.

To do it (in perl) for several positions (given in a file formated like in your question):

open(POSITIONS,"positions.txt");
while(<POSITIONS>){
chomp;
my ($seqName,$begin,$end) = split(/\s/); open(SAMTOOLS,"/path/to/samtools faidx /path/to/reference.fasta$seqName:$begin-$end |");
while(my $line = <SAMTOOLS>){ print$line;
}
close(SAMTOOLS);
}
close(POSITIONS);

1
Entering edit mode

can it take batch query present in a file??

0
Entering edit mode

you can always use a small perl script with a loop to read in your query one by one and split out the output one by one

0
Entering edit mode

I edited my post to take into account your comment, and thus do it for several coordinates.

0
Entering edit mode

Hi, can u please provide me a script for extracting region as follow from a file:

>gi|546867469
TAGCTACATCAGTTGAAGATGGGGTTGCTGGGTAGCTGTCAGGCATAGGAGTCCCATCTTTCTCGGAAAG
AATGACAATAAGGTTGTTAAAGGGCAATCAGTCAAGGACAGGGGGAGCCACTCAAATCAGCACATGTGGC
ATGACAGTGATGCAGCATTCCCCTCAACCACCAATTGCTCCCAGGAATAGCACCTGGCTGGTCTGGCATC
TCGGGATGAGGGCTAAGGCACAGTGCCATTAGTGCCAAGAACCAAAGATCTGAGTCTTTGCAGACCCTAA
AAATTAAATCCTTTCTGTGCCCCGGGTGGAGGGGGAGAGTTCAAGTCACTTTATAATATTTGCCTCATAT
ACATATCACCTGTCCCCTTCCACCTCCTTTTTCTTGCAGTCCTAATTATCCAAGTAGCATCTACAACTCT


region to extract from 150 to 160 20-20 up/down stream

Kindly help me to extract this region. with perl script or awk command.

Thanku.

0
Entering edit mode

Hi,

I have a genome sequence of different chromosome file and a region file. I need to fetch sequence according region position.

kindly help if you got solution already using perl or any other.

file 1:

5087 5287
20254 23084
28710 28811
30468 30566


File 2:

>Pt 1154478:1
ATGGGCGAACGACGGGAATTGAACCCGCGATGGTGAATTCACAATCCACTGCCTTAATCC
ACTTGGCTACATCCGCCCCTACGCTACTATCTATTCTTTTTTGTATTGTCTAAAAAAAAA
AAAAAATACAAATTTCAATAAAAAATAAAAAAAGGTAGCAAATTCCACCTTATTTTTTTT
CTAATAAAAAATATATAGTAATTTTTTATTATTTATTATTATTATTTATTATTAATATAA
TAAATAAAGTAAAATATGATACTCTATAAAAATTTGCTCATTTTTATAGAAAAAAACGAG
TAATATAAGCCCTCTTTCTTATTTAAAGAAGGCTTATATTGCTCGTTTTTTACTAAACTA
GATCTAGACTAACACTAACGAATTATCCATTTGTAGATGGAGCCTCAACAGCAGCTAGGT
CTAGAGGGAAGTTGTGAGCATTACGTTCATGCATAACTTCCATACCAAGGTTAGCACGGT
TAATAATATCAGCCCAAGTATTAATAACACGTCCTTGACTATCAACTACTGATTGGTTGA

0
Entering edit mode

Samtools is not a great solution for a Windows user. You might use pyfaidx, since it works well on Windows.

faidx --delimiter=' ' --bed regions.bed input.fasta


The above command will split your fasta identifiers on spaces, so that you can use the regions you provided to query just the species names.

0
Entering edit mode

0
Entering edit mode

If you"ve installed pyfaidx using pip install --user pyfaidx then the script will be at $HOME/.local/bin/faidx and if installed using sudo pip install pyfaidx the script will be at /usr/local/bin/faidx ADD REPLY 0 Entering edit mode Can you do this with Rsamtools or another Bioconductor packages? ADD REPLY 6 Entering edit mode 10.3 years ago Christian ★ 3.0k BioPerl is your friend: use Bio::DB::Fasta; my$db = Bio::DB::Fasta->new('myfile.fasta');
my $seq =$db->seq('lyrata', 1 => 108);

2
Entering edit mode
4.3 years ago
tuomastik ▴ 20

It's also possible to use bedtools getfasta, which:

extracts sequences from a FASTA file for each of the intervals defined in a BED/GFF/VCF file.

1
Entering edit mode
10.4 years ago
Vitis ★ 2.5k

the command-line blast package will do. first use formatdb with -o option to build a database of your fasta, then use fastacmd with -s -L options to get the regions.

0
Entering edit mode

could u give full command example for that...

thanks

0
Entering edit mode

-o option showing unknown argument error

1
Entering edit mode
6.3 years ago
daniel ▴ 30

Just thought I'd give my C++ library, Bioio, a mention here. Although primarily a C++ API for FASTA/Q I/O, I also include a small command line application demo fasta.cpp which can be used to retrieve indexed FASTA subsequences, e.g.

./fasta human_g1k_v37.fasta X:10000-20000


Like samtools, the result is output to stdout, but unlike samtools only the sequence itself is output, which may be more useful if piping to other applications etc.