Question: Extract User Defined Region From An Fasta File
9
gravatar for Palu
7.7 years ago by
Palu240
Palu240 wrote:

I have a file with multi fasta format sequence

>lyrata FEN1
MGIKGLTKLLADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGE
VTSHLQGMFNRTIRLLEAGIKPVYVFDGKPPELKRQELAKRYSKRADATADLTGAIEAGN
KEDIEKYSKRTVKVTKQHNDDCKRLLRLMGVPVVEATSEAEAQCAALCKSGKVYGVASED
MDSLTFGAPKFLRHLMDPSSRKIPVMEFEVAKILEELQLTMDQFIDLCILSGCDYCDSIR
GIGGQTALKLIRQHGSIETILENLNKERYQIPEEWPYNEARKLFKEPDVITDEEQLDIKW
TSPDEEGIVQFLVNENGFNIDRVTKAIEKIKTAKNKSSQGRLESFFKPVANSSVPAKRKG
NLICLVVSEVIPQYTGSQGTKCLLRQYKPAISNKVSSLHAFGWNSSKNIVSIRLFSLSWG
FNLSEIPESTTKGAANKKTKGAGGRKKK

>Arabidopsis thaliana FEN1
MGIKGLTKLLADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGE
VTSHLQGMFNRTIRLLEAGIKPVYVFDGKPPELKRQELAKRYSKRADATADLTGAIEAGN
KEDIEKYSKRTVKVTKQHNDDCKRLLRLMGVPVVEATSEAEAQCAALCKSGKVYGVASED
MDSLTFGAPKFLRHLMDPSSRKIPVMEFEVAKILEELQLTMDQFIDLCILSGCDYCDSIR
GIGGQTALKLIRQHGSIETILENLNKERYQIPEEWPYNEARKLFKEPDVITDEEQLDIKW
TSPDEEGIVQFLVNENGFNIDRVTKAIEKIKTAKNKSSQGRLESFFKPVANSSVPAKRKG
NLICLVASEVIPNILDHRVPNACSGSINQPSAIKSPLCMLLDGTPPITLLALGCSLYPGL
PTFRFTWLEEIPESTTKGAANKKTKGAGGRKKK

>Brachypodium distachyon - II FEN1
MGIKGLTKLLAEHAPRAAAQRRVEDYRGRVIAIDASLSIYQFLVVVGRKGTEVLTNEAEG
LTVDCYARFVFDGEPPDLKKRELAKRSLRRDDASEDLNRAIEVGDEDSIEKFSKRTVKIT
KKHNDDCKKLLRLMGVPVVEAPGEAEAQCASLCKNHKAYAVASEDMDSLTFGSLRFLRHI
TDLSFKRSPVTEFEVPKVLEELGLTMDQFIDLCILSGCDYCENIKGIGGQRALKLIRQHG
CIEEVVQNLNNRFTVPEDWPYQEVRTLFKEPNVCTEIPDFQWTSVDKEGIVNFLAIENSF
SSDRVEKAVQKIKAARDRYSPGRVKLLTPVANLSGSITKKEPECVLGSSGQGLKSWSALQ
VCRSSSSDFRYRSYYSSKPLVLGRQSGFHGIPHAFSLI

and i want to extract sequence of defined region. these parameters are present in other file

lyrata            1      108 
lyrata             147    234 
Arabidopsis          1    108 
Arabidopsis        147    234 
Brachypodium         1     93 
Brachypodium       132    219

I have asked similar question before and got an awk solution but it gives error. Please help me. i am a window user.

fasta sequence • 35k views
ADD COMMENTlink modified 21 months ago by tuomastik10 • written 7.7 years ago by Palu240
1

HI, thanks for answering my previous and this question.. I found a very good solution of my problem ans love to share with ther have same problem. visit this page. it's working good

ADD REPLYlink written 7.7 years ago by Palu240

This is not a discussion forum, It's Q&A. Youve posted this as an 'answer' which doesn't make sense. To reply to another answer use the comment button underneath it, like I am here.

ADD REPLYlink written 7.7 years ago by Karl330
21
gravatar for Fred
7.7 years ago by
Fred700
Paris, France
Fred700 wrote:

You can also use the samtools package. Samtools allows you to index a fasta file. Here is an excerpt from the manual:

samtools faidx ref.fasta [region1 [...]] : Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.fai on the disk. If regions are speficified, the subsequences will be retrieved and printed to stdout in the FASTA format.

Once you indexed your reference, you can query it.

For example:

samtools faidx reference.fasta
samtools faidx reference.fasta lyrata:1-108

Samtools is designed to handle nucleotides, so I am not sure that it will work with protein sequences.

To do it (in perl) for several positions (given in a file formated like in your question):

open(POSITIONS,"positions.txt");
while(<POSITIONS>){
    chomp;
    my ($seqName,$begin,$end) = split(/\s/);
    open(SAMTOOLS,"/path/to/samtools faidx /path/to/reference.fasta $seqName:$begin-$end |");
    while(my $line = <SAMTOOLS>){
    print $line;
    }
    close(SAMTOOLS);
}
close(POSITIONS);
ADD COMMENTlink modified 7.7 years ago • written 7.7 years ago by Fred700
1

can it take batch query present in a file??

ADD REPLYlink written 7.7 years ago by Palu240

you can always use a small perl script with a loop to read in your query one by one and split out the output one by one

ADD REPLYlink written 7.7 years ago by Vitis2.1k

I edited my post to take into account your comment, and thus do it for several coordinates.

ADD REPLYlink written 7.7 years ago by Fred700

hi, can u please provide me a script for extracting region as follow from a file :

>gi|546867469
TAGCTACATCAGTTGAAGATGGGGTTGCTGGGTAGCTGTCAGGCATAGGAGTCCCATCTTTCTCGGAAAG
AATGACAATAAGGTTGTTAAAGGGCAATCAGTCAAGGACAGGGGGAGCCACTCAAATCAGCACATGTGGC
ATGACAGTGATGCAGCATTCCCCTCAACCACCAATTGCTCCCAGGAATAGCACCTGGCTGGTCTGGCATC
TCGGGATGAGGGCTAAGGCACAGTGCCATTAGTGCCAAGAACCAAAGATCTGAGTCTTTGCAGACCCTAA
AAATTAAATCCTTTCTGTGCCCCGGGTGGAGGGGGAGAGTTCAAGTCACTTTATAATATTTGCCTCATAT
ACATATCACCTGTCCCCTTCCACCTCCTTTTTCTTGCAGTCCTAATTATCCAAGTAGCATCTACAACTCT


region to extract from
150 to 160  20-20 up/down stream

kindly help me to extract this region. with perl script or awk command. 

Thanku.

ADD REPLYlink written 3.7 years ago by harpreetmanku0410

Hi,

I have a genome sequence of different chromosome file and a region file. I need to fetch sequence according region position.

kindly help if you got solution already using perl or any other.

file 1:

5087 5287
20254 23084
28710 28811
30468 30566

File 2:

>Pt 1154478:1
ATGGGCGAACGACGGGAATTGAACCCGCGATGGTGAATTCACAATCCACTGCCTTAATCC
ACTTGGCTACATCCGCCCCTACGCTACTATCTATTCTTTTTTGTATTGTCTAAAAAAAAA
AAAAAATACAAATTTCAATAAAAAATAAAAAAAGGTAGCAAATTCCACCTTATTTTTTTT
CTAATAAAAAATATATAGTAATTTTTTATTATTTATTATTATTATTTATTATTAATATAA
TAAATAAAGTAAAATATGATACTCTATAAAAATTTGCTCATTTTTATAGAAAAAAACGAG
TAATATAAGCCCTCTTTCTTATTTAAAGAAGGCTTATATTGCTCGTTTTTTACTAAACTA
GATCTAGACTAACACTAACGAATTATCCATTTGTAGATGGAGCCTCAACAGCAGCTAGGT
CTAGAGGGAAGTTGTGAGCATTACGTTCATGCATAACTTCCATACCAAGGTTAGCACGGT
TAATAATATCAGCCCAAGTATTAATAACACGTCCTTGACTATCAACTACTGATTGGTTGA

 

 

ADD REPLYlink written 3.5 years ago by Manoj30

    Please help me. i am a window user.

Samtools is not a great solution for a Windows user. You might use pyfaidx, since it works well on Windows.

faidx --delimiter=' ' --bed regions.bed input.fasta

The above command will split your fasta identifiers on spaces, so that you can use the regions you provided to query just the species names.

ADD REPLYlink written 3.6 years ago by Matt Shirley8.9k

pyfaidx showing command not found

ADD REPLYlink written 3.6 years ago by Manoj30

If you're installed pyfaidx using pip install --user pyfaidx then the script will be at $HOME/.local/bin/faidx and if installed using sudo pip install pyfaidx the script will be at /usr/local/bin/faidx

ADD REPLYlink written 3.6 years ago by Matt Shirley8.9k

Can you do this with Rsamtools or another Bioconductor packages?

ADD REPLYlink written 16 months ago by cristian220
5
gravatar for Christian
7.7 years ago by
Christian2.8k
Cambridge, US
Christian2.8k wrote:

BioPerl is your friend:

use Bio::DB::Fasta;
my $db = Bio::DB::Fasta->new('myfile.fasta');
my $seq = $db->seq('lyrata', 1 => 108);

see also http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/DB/Fasta.html

ADD COMMENTlink modified 7.7 years ago by Casey Bergman18k • written 7.7 years ago by Christian2.8k
1
gravatar for Vitis
7.7 years ago by
Vitis2.1k
New York
Vitis2.1k wrote:

the command-line blast package will do. first use formatdb with -o option to build a database of your fasta, then use fastacmd with -s -L options to get the regions.

ADD COMMENTlink written 7.7 years ago by Vitis2.1k

could u give full command example for that...

thanks

ADD REPLYlink written 3.6 years ago by Manoj30

-o option showing unknown argument error

ADD REPLYlink written 3.6 years ago by Manoj30
1
gravatar for daniel
3.6 years ago by
daniel30
United Kingdom
daniel30 wrote:

Just thought I'd give my C++ library, Bioio, a mention here. Although primarily a C++ API for FASTA/Q I/O, I also include a small command line application demo fasta.cpp which can be used to retrieve indexed FASTA subsequences, e.g.

./fasta human_g1k_v37.fasta X:10000-20000

Like samtools, the result is output to stdout, but unlike samtools only the sequence itself is output, which may be more useful if piping to other applications etc.

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by daniel30
1
gravatar for tuomastik
21 months ago by
tuomastik10
Finland
tuomastik10 wrote:

It's also possible to use bedtools getfasta, which:

extracts sequences from a FASTA file for each of the intervals defined in a BED/GFF/VCF file.

ADD COMMENTlink written 21 months ago by tuomastik10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1513 users visited in the last hour