python script for deleting basepair in fasta format
3
0
Entering edit mode
8.5 years ago

I have a multiple fasta format sequence in a notepad, I want to delete rest of the base above 500 bp, in each sequence. what script I can use.

thanks

sequence python • 2.7k views
1
Entering edit mode

A simple sed command might do the trick if your fasta sequences are on one line (i.e. not wrapped):

sed --regexp-extended '/^>/! s/^(.{1,500}).*/\1/' input.fasta > output.fasta
2
Entering edit mode

Or using awk (I like awk!):

awk '{print /^>/ ? $0 : substr($1, 1, 500)}' input.fasta > output.fasta
0
Entering edit mode

Thanks for the quick reply.

Can I use this command on CentOS Linux terminal directly or do I need to download any other package?

0
Entering edit mode

These commands should work on most Linux or Unix computers without installing any additional package.

0
Entering edit mode

Hi Frederic,

I have one more problem, Hope you can help me with that too

I have a single contig of around 5000 bp, which looks like

>contig 00001
ACTTTCAAAAAAATCAGCGTAAAAAACATACTAATTTGGGCAAATTCCCACCTGTTTTTAGGGACATTTT
TCTTTGAATTAGAGCCTCAGCAGCTCGTCATTGCTGAATTTTCTTGAAGTTGCCTCGTTTCTAGGCGCTT
ACGCTGGTCGTAGCCGGTTTCTGACTACGAATGATTTAAAAATTTAGTTTGTCATCTATGCTTCGTACAT
AGATAAAGATAATAATGCCTTTCCATCTATTTTAGATTGAGAGAGGCTTTTTCTGTCATGATTAAATTAG
GAAAGGAGTGAGCACTACTTTGGAAAATATTGAAGAATTATGGAGCGCTACATTAAAAAAAATTGAAGAA
AAATTGAGTAAACCTAGCTTTGATACGTGGCTTAAAAATACAAAAGCAGAAGCATTAGAAAAAGATACAC
TAATTATTTCTGCTCCAAATGAATTCGCAAGAGATTGGTTAGAAAATCAATACACGAACTTAATTTCTCA
GATGCTACTTGAAGTTACAGGATCTGAATTAAATACAAAATTTATCATTCCTGATTCATTAGAAGAAATA
GAAGAGCAAAAGCCTATGCCAAAACCAAAGCAGTCTACAGATACAGGTGACTCACCAAAATCGATGCTTA
ACTCAAAATATACTTTTGATACATTTGTCATTGGAGCTGGTAATCGTTTCGCTCATGCTGCTTCATTAGC


I want to have a multiple contigs of 100 bp, like

>contig 00001
ACTTTCAAAAAAATCAGCGTAAAAAACATACTAATTTGGGCAAATTCCCACCTGTTTTTAGGGACATTTTTCTTTGAATTAGAGCCTCAGCAGCTCGTCATTGCTGAATTTTCTTGAAGTTGCCTCGTTTCTAGGCGCTT
>contig 00002
ACGCTGGTCGTAGCCGGTTTCTGACTACGAATGATTTAAAAATTTAGTTTGTCATCTATGCTTCGTACATAGATAAAGATAATAATGCCTTTCCATCTATTTTAGATTGAGAGAGGCTTTTTCTGTCATGATTAAATTAG
>contig 00003
GAAAGGAGTGAGCACTACTTTGGAAAATATTGAAGAATTATGGAGCGCTACATTAAAAAAAATTGAAGAAAAATTGAGTAAACCTAGCTTTGATACGTGGCTTAAAAATACAAAAGCAGAAGCATTAGAAAAAGATACAC
>contig 00004
TAATTATTTCTGCTCCAAATGAATTCGCAAGAGATTGGTTAGAAAATCAATACACGAACTTAATTTCTCAGATGCTACTTGAAGTTACAGGATCTGAATTAAATACAAAATTTATCATTCCTGATTCATTAGAAGAAATA
>contig 00005
GAAGAGCAAAAGCCTATGCCAAAACCAAAGCAGTCTACAGATACAGGTGACTCACCAAAATCGATGCTTAACTCAAAATATACTTTTGATACATTTGTCATTGGAGCTGGTAATCGTTTCGCTCATGCTGCTTCATTAGC


and so on...

What command should I use?

Best!

0
Entering edit mode
0
Entering edit mode

thanks Frederic, it works... thanks alot.

1
Entering edit mode

Cross posted (to Linkedin???)

1
Entering edit mode
8.5 years ago
Cytosine ▴ 460

Don't know of any scripts that do that, but it's really easy to write up one yourself. Try this one with using BioPython...

from Bio import SeqIO
import sys

fastafile, length = sys.argv[1], int(sys.argv[2])
entries = SeqIO.parse(fastafile, "fasta")

for entry in entries:
print ">"+entry.name
print entry.seq[:length]


Save it to a file (e.g. parse_fasta.py) and then run it in a terminal (or cmd if you're on windows) with:

python parse_fasta.py <input.fasta> <sequence_lengths> > <output.fasta>

0
Entering edit mode

Or slightly more elegantly using SeqIO.write and a generator expression,

from Bio import SeqIO
import sys
fastafile, length = sys.argv[1], int(sys.argv[2])
trimmed = (entry[:length] for entry in SeqIO.parse(fastafile, "fasta"))
SeqIO.write(trimmed, sys.stdout, "fasta")

0
Entering edit mode

I am not a bioinformatician, so I don't know where to use this command, I am using CentOS Linux system, if I open terminal window, can I directly type the above script or I need to download any package before using the script.

Thanks.

1
Entering edit mode

Running this is as simple as saving the script to a file (like parse_fasta.py from my post) and running the python interpreter over that script (with the commands I mentioned at the end of my post).

If you have a fresh installation, you'll probably be missing the Biopython library. In that case you can install it by following the steps mentioned here: http://centosn00b.blogspot.be/2012/05/biopython-tip.html

Remember to prepend "sudo" to the commands if you do not have administrator privileges.

1
Entering edit mode

Good information, but I would caution against advising beginning users to "make things work" but adding sudo.

0
Entering edit mode
8.5 years ago

pip install pyfaidx

python

And then paste the script, changing your file name.