Question: how to extract particular regions of a sequence from fasta file if its range for each sequence from fasta file is given in another txt file for all the sequences contains in fasta file ?
1
gravatar for ak93sharma
23 months ago by
ak93sharma10
ak93sharma10 wrote:

Hi folks, I have a fasta file ( 716_seq.fasta ) with 716 sequences I want a specific region which is different in all 716 sequences and I have the range in a different txt file (range.txt ) for each 716 sequence. I am able to do this for only for a single range for all the sequences by providing its range in code. I want it takes second range for second sequence and 3rd range for 3rd sequence, and so on. Please help.

716_seq.fasta file like this:

>header_seq1_>xyz
ATGACATGACGATC
ATCGTGACGTACGT
ATCGA
>header_seq2_>mno
ATCGGCGGTATTTA
ACGGTGGA
>header_seq3_>pqr
ATCGATCAGTACGA
ACGATGACGAT

range.txt file like this:

4 -6
7- 8
2- 5

the possible outcome should be like this:

>header_seq1_>xyz
ACA
>header_seq2_>mno
GG
>header_seq3_>pqr
TCGA

i write a code which follows as

grep -v ">" pv.fasta |sed ':a;N;$!ba;s/\n//g' | cut -c34655-36384

but it gives same region ( takes same range 34655 to 36384) for all the sequeces and output follows as :

TGAAAACCTTATATTCCCTGAGGAGGTTCTACCCCGTGGAAACGCTCTTTAATGGAACTTTAGCTTTAGCTGGTCGTGACCAAGAAACCACCGGTTTCGCTTGGTGGGCCGGGAATGCCCGACTTATCAATTTATCTGGTAAACTACTTGGGGCTCATGTAGCCCATGCCGGATTAATCGTATTCTGGGCCGGAGCAATGACCCTATTTGAAGTGGCTCATTTCCTACCAGAGAAACCCATGTATGAACAAGGCTTGATTTTACTTCCGCACCTAGCGACTCTAGGTTGGGGGGTAGGTCCTGGTGGGGAAGTTATAGATACCTTTCCATACTTTGTATCTGGAGTACTTCACCTAATTTCCTCCGCAGTATTGGGCTTTGGCGGTATTTATCATGCACTTCTGGGCCCCGAGACTCTTGAAGAATCTTTTCCATTCTTCGGTTATGTATGGAAAGATAGAAATAAAATGACCACAATTTTAGGTATTCACTTAGTCTTGTTAGGTATAGGTGCTTTTCTTCTAGTATTCAAGGCTCTTTATTTTGGAGGCGTATATGATACCTGGGCTCCGGGCGGGGGGGATGTAAGAAAAATAACCAACTTAACCCTCAGTCCAGGCGTTATATTTGGTTATTTACTAAAATCGCCCTTTGGAGGAGAAGGCTGGATTGTTAGTGTGGATGATTTGGAAGATATAATTGGAGGGCATGTATGGTTAGGCTCCATTTGTATACTTGGTGGAGTTTGGCATATCTTAACCAAACCTTTTGCATGGGCTCGCCGTGCACTTGTATGGTCTGGTGAGGCTTACTTGTCTTATAGTTTAGGGGCTTTATCTGTCTTTGGCTTCATTGCTTGTTGCTTTGTATGGTTCAACAACACCGCCTATCCTAGTGAGTTTTACGGGCCAACTGGGCCAGAAGCTTCTCAAGCTCAAGCATTTACTTTTCTAGTTAGAGACCAACGTCTTGGGGCTAATGTAGGATCTGCTCAAGGACCTACTGGTTTAGGTAAATATCTAATGCGCTCTCCGACTGGAGAAGTGATTTTTGGAGGAGAAACTATGCGCTTTTGGGATCTGCGCGCTCCTTGGTTAGAACCTCTAAGGGGTCCAAATGGTTTGGACTTGAGTAGGCTGAAAAAAGACATACAACCTTGGCAAGAACGGCGTTCCGCAGAATATATGACTCATGCTCCTTTAGGTTCTTTAAATTCCGTGGGTGGCGTAGCTACCGAGATCAATGCAGTCAACTATGTCTCTCCTAGAAGTTGGTTAGCTACTTCGCATTTTGTTCTCGGGTTCTTCCTATTCGTAGGTCATCTGTGGCACGCGGGAAGGGCTCGTGCAGCTGCAGCAGGATTTGAAAAAGGAATCGATCGCGATTTTGAACCTGTTCTTTCCATGACCCCTCTTAACTGAGACAGGCGATCAGATGTTTGACATAGGAATCTCCAACATACAATACATATTGGGACCGGGTCATACTTAAAAAGTATTCGTTATTCCTTATCTTTTTTTTTTCAATCTATATCTAAATCGAATCTATTTTTTCTGGCTCGGCTATTCCACCTAGCCGAGCCATTCCGCCTTTTGGCCGGGCAAAACCGATAAAGAAATCTATTCGTCGAGCAAAAAAAGGAGAGAGAGGGATTCGAACCCTCGATAGTTCTTTGTTTAGAACTATACCGGTTTTCAAGACCGGAGCTATCAACCGCTCGGCCATCTCTC
ADD COMMENTlink modified 23 months ago by Paul1.2k • written 23 months ago by ak93sharma10
3

This looks like one of the problems from your classroom assignments.

What do you mean by "I am able to do this for only a single sequence by providing its range its takes huge time if I extract particular region one by one" ? Are you doing it manually ? What else did you try? Did you try to write a program, if yes, share the code and the exact error message. Or, did you try to google?

Biostars is not meant for providing ready answers.

ADD REPLYlink written 23 months ago by Vijay Lakhujani3.4k
1

This could easily be done using a (Bio)python script. Let me know if you can't figure it out.

ADD REPLYlink written 23 months ago by WouterDeCoster35k

i have already written a code which gives me same region from all sequence , my code follows as

grep -v ">" pv.fasta |sed ':a;N;$!ba;s/\n//g' | cut -c34655-36384

and gives output :

TGAAAACCTTATATTCCCTGAGGAGGTTCTACCCCGTGGAAACGCTCTTTAATGGAACTTTAGCTTTAGCTGGTCGTGACCAAGAAACCACCGGTTTCGCTTGGTGGGCCGGGAATGCCCGACTTATCAATTTATCTGGTAAACTACTTGGGGCTCATGTAGCCCATGCCGGATTAATCGTATTCTGGGCCGGAGCAATGACCCTATTTGAAGTGGCTCATTTCCTACCAGAGAAACCCATGTATGAACAAGGCTTGATTTTACTTCCGCACCTAGCGACTCTAGGTTGGGGGGTAGGTCCTGGTGGGGAAGTTATAGATACCTTTCCATACTTTGTATCTGGAGTACTTCACCTAATTTCCTCCGCAGTATTGGGCTTTGGCGGTATTTATCATGCACTTCTGGGCCCCGAGACTCTTGAAGAATCTTTTCCATTCTTCGGTTATGTATGGAAAGATAGAAATAAAATGACCACAATTTTAGGTATTCACTTAGTCTTGTTAGGTATAGGTGCTTTTCTTCTAGTATTCAAGGCTCTTTATTTTGGAGGCGTATATGATACCTGGGCTCCGGGCGGGGGGGATGTAAGAAAAATAACCAACTTAACCCTCAGTCCAGGCGTTATATTTGGTTATTTACTAAAATCGCCCTTTGGAGGAGAAGGCTGGATTGTTAGTGTGGATGATTTGGAAGATATAATTGGAGGGCATGTATGGTTAGGCTCCATTTGTATACTTGGTGGAGTTTGGCATATCTTAACCAAACCTTTTGCATGGGCTCGCCGTGCACTTGTATGGTCTGGTGAGGCTTACTTGTCTTATAGTTTAGGGGCTTTATCTGTCTTTGGCTTCATTGCTTGTTGCTTTGTATGGTTCAACAACACCGCCTATCCTAGTGAGTTTTACGGGCCAACTGGGCCAGAAGCTTCTCAAGCTCAAGCATTTACTTTTCTAGTTAGAGACCAACGTCTTGGGGCTAATGTAGGATCTGCTCAAGGACCTACTGGTTTAGGTAAATATCTAATGCGCTCTCCGACTGGAGAAGTGATTTTTGGAGGAGAAACTATGCGCTTTTGGGATCTGCGCGCTCCTTGGTTAGAACCTCTAAGGGGTCCAAATGGTTTGGACTTGAGTAGGCTGAAAAAAGACATACAACCTTGGCAAGAACGGCGTTCCGCAGAATATATGACTCATGCTCCTTTAGGTTCTTTAAATTCCGTGGGTGGCGTAGCTACCGAGATCAATGCAGTCAACTATGTCTCTCCTAGAAGTTGGTTAGCTACTTCGCATTTTGTTCTCGGGTTCTTCCTATTCGTAGGTCATCTGTGGCACGCGGGAAGGGCTCGTGCAGCTGCAGCAGGATTTGAAAAAGGAATCGATCGCGATTTTGAACCTGTTCTTTCCATGACCCCTCTTAACTGAGACAGGCGATCAGATGTTTGACATAGGAATCTCCAACATACAATACATATTGGGACCGGGTCATACTTAAAAAGTATTCGTTATTCCTTATCTTTTTTTTTTCAATCTATATCTAAATCGAATCTATTTTTTCTGGCTCGGCTATTCCACCTAGCCGAGCCATTCCGCCTTTTGGCCGGGCAAAACCGATAAAGAAATCTATTCGTCGAGCAAAAAAAGGAGAGAGAGGGATTCGAACCCTCGATAGTTCTTTGTTTAGAACTATACCGGTTTTCAAGACCGGAGCTATCAACCGCTCGGCCATCTCTC

ADD REPLYlink modified 23 months ago • written 23 months ago by ak93sharma10
1

samtools faidx is your friend (as are biopython and pyfasta).

ADD REPLYlink written 23 months ago by Devon Ryan86k

For basic tasks like this, there are already a couple of tools present which are open source. For your task, checkout, bedtools getfasta here.

ADD REPLYlink written 23 months ago by Vijay Lakhujani3.4k
1

bedtools getfasta isn't going to help OP. He needs a different interval from each fasta entry.

ADD REPLYlink modified 23 months ago • written 23 months ago by WouterDeCoster35k
4
gravatar for Paul
23 months ago by
Paul1.2k
European Union
Paul1.2k wrote:

Hi,

what about to try some pure bash scripting. Lets assume your range is called range.txt a fasta file is fasta.fa. Put all files in one directory and make solution.sh executable (chmod +x solution.sh):

cat solution.sh

#!/bin/bash
lines=$(awk 'END {print NR}' range.txt)
for ((a=1; a<= $lines ; a++))
 do 
 number=$(awk -v lines=$a 'NR == lines' range.txt)
 grep -v ">" fasta.fa | awk -v lines=$a 'NR == lines' | cut -c$number
done;

run it like

./solution.sh

This would be probably slow for very huge fasta files.

Cheers.

ADD COMMENTlink modified 19 months ago by RamRS19k • written 23 months ago by Paul1.2k

@Paul , Thank you so much, its works well. I made a sample file before running it to a huge fasta file containing 716 genomes.

sample fasta and range file.

cat fasta.fa
>header1
ABCDEFGHIJGGGGGGGGGGG
>header2
KLMNOPQRSTUVWCCCCCCC


cat range.txt 
4-6
2-5

I made code like this

cat fasta.fa |sed -n 2~2p |sed '1q;d'|cut -c4-6
output:
DEF
cat fasta.fa |sed -n 2~2p |sed '2q;d'|cut -c2-5
output:
LMNO

I want to pass variable $L and $R in such way in my code that desired result will come, please help me out.

  R in `cat range.txt`
       L in {1..2}


cat fasta.fa |sed -n 2~2p |sed '$Lq;d'|cut -c$R
ADD REPLYlink modified 23 months ago • written 23 months ago by ak93sharma10

You are welcome. I tried it on the huge fasta file and it seems to me it works fast. Did my solution works?

ADD REPLYlink modified 23 months ago • written 23 months ago by Paul1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 824 users visited in the last hour