Question: Extract a specific subsequence from a fasta file
0
gravatar for PAn
3.7 years ago by
PAn20
United States
PAn20 wrote:

Hi,

I have a 14 bp long sequence and I need to extract 10 bp +/- of this sequence from a list of fatsa sequences. Is there a tool that I can use to get the sequence string specific trimmed reads?

The substring is GCACGAAGTTTTGC a sample read from the fasta file looks like -

>58_1_uni0133|DBLa
TGTAAGCTTAGTCACAAATTCCATACTAATATAACATATGAATACGAGAGAGATCCTTGT
CATGGAAGAAAAGAAAATCGTTTTGATGAAAATGAGGAATTTGAATGTGGAACTAAAATA
CGTGATTATAATAAAAAAGATTCTGGTACAGCATGTGCACCATTCAGAAGACAAAATATG
TGTGATAAAAATTTAGAATATTTGATCAATAAAAACACAGAAAATACTGATGATTTGTTA
GGAAATGTATTGGTTACAGCAAAATATGAAGGTGAATCTATTGTTGCGAAGCATCCACAT
AAAGACAATTCACAAGTATGTACTGCACTTGCACGAAGTTTTGCAGATATAGGAGATATT
GTAAGAGGAAGAGATATGTTTTTACCTAATAAGGATGATAAAGTACAAAAAGGACTACAA
GTAGTTTTCGAGAAAATAAATAATGGATTGAAGAAAATAGGAATTAATGCTTATAATGAT
GGATCTGGAAATTATTCTAAATTAAGAGAAGTTTGGTGGAATGTGAATAGAGACCAGGTA
TGGAGAGCTATAACATGTTCAGCACCAGGTGATGTTAATTATTTTAGAAAAATTTCAGGA
GACACTAGGACCTTTGAAAA

and the fasta file has around 170 reads with variable length of 600 to 800 bp. I tried to find tools but has no success. Is there a better alternative than writing a code?

Thanks! Ankita

extract sequence fatsa • 1.8k views
ADD COMMENTlink modified 3.7 years ago by IP700 • written 3.7 years ago by PAn20
4
gravatar for Pierre Lindenbaum
3.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

using awk:

awk  '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta | awk -F '\t' '{x=index($2,"GCACGAAGTTTTGC");if(!x) next; B=(x<10?1:x-10);E=x+10+14;printf("%s\n%s\n",$1,substr($2,B,E-B));}' 

>58_1_uni0133|DBLa
TACTGCACTTGCACGAAGTTTTGCAGATATAGGA
ADD COMMENTlink written 3.7 years ago by Pierre Lindenbaum131k

This works perfectly!

ADD REPLYlink written 3.7 years ago by PAn20
1
gravatar for IP
3.7 years ago by
IP700
Denmark/University of Copenagen
IP700 wrote:

Even though you would need to code a little, this will help you, this how I would proceed:

First of all, I will identify the location of your sequences in the fasta file using a script posted in an old post: Finding a sequence in a fasta file that can be accesed here https://github.com/dariober/bioinformatics-cafe/tree/master/fastaRegexFinder

This will output a bed file of the locations of the sequence in your fasta. Then, programming just a little, you could add and subtract 10 to the locations in your bed file, and finally you can use getfasta from bedtools to get the sequences you want

http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html

ADD COMMENTlink written 3.7 years ago by IP700
1

Okay @Pierre Lindenbaum answer is going to work much faster than mine

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by IP700
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2128 users visited in the last hour