python script for deleting basepair in fasta format
3
0
Entering edit mode
8.5 years ago

I have a multiple fasta format sequence in a notepad, I want to delete rest of the base above 500 bp, in each sequence. what script I can use.

thanks

sequence python • 2.7k views
ADD COMMENT
1
Entering edit mode

A simple sed command might do the trick if your fasta sequences are on one line (i.e. not wrapped):

sed --regexp-extended '/^>/! s/^(.{1,500}).*/\1/' input.fasta > output.fasta
ADD REPLY
2
Entering edit mode

Or using awk (I like awk!):

awk '{print /^>/ ? $0 : substr($1, 1, 500)}' input.fasta > output.fasta
ADD REPLY
0
Entering edit mode

Thanks for the quick reply.

Can I use this command on CentOS Linux terminal directly or do I need to download any other package?

ADD REPLY
0
Entering edit mode

These commands should work on most Linux or Unix computers without installing any additional package.

ADD REPLY
0
Entering edit mode

Hi Frederic,

I have one more problem, Hope you can help me with that too

I have a single contig of around 5000 bp, which looks like

>contig 00001
ACTTTCAAAAAAATCAGCGTAAAAAACATACTAATTTGGGCAAATTCCCACCTGTTTTTAGGGACATTTT
TCTTTGAATTAGAGCCTCAGCAGCTCGTCATTGCTGAATTTTCTTGAAGTTGCCTCGTTTCTAGGCGCTT
ACGCTGGTCGTAGCCGGTTTCTGACTACGAATGATTTAAAAATTTAGTTTGTCATCTATGCTTCGTACAT
AGATAAAGATAATAATGCCTTTCCATCTATTTTAGATTGAGAGAGGCTTTTTCTGTCATGATTAAATTAG
GAAAGGAGTGAGCACTACTTTGGAAAATATTGAAGAATTATGGAGCGCTACATTAAAAAAAATTGAAGAA
AAATTGAGTAAACCTAGCTTTGATACGTGGCTTAAAAATACAAAAGCAGAAGCATTAGAAAAAGATACAC
TAATTATTTCTGCTCCAAATGAATTCGCAAGAGATTGGTTAGAAAATCAATACACGAACTTAATTTCTCA
GATGCTACTTGAAGTTACAGGATCTGAATTAAATACAAAATTTATCATTCCTGATTCATTAGAAGAAATA
GAAGAGCAAAAGCCTATGCCAAAACCAAAGCAGTCTACAGATACAGGTGACTCACCAAAATCGATGCTTA
ACTCAAAATATACTTTTGATACATTTGTCATTGGAGCTGGTAATCGTTTCGCTCATGCTGCTTCATTAGC

I want to have a multiple contigs of 100 bp, like

>contig 00001
ACTTTCAAAAAAATCAGCGTAAAAAACATACTAATTTGGGCAAATTCCCACCTGTTTTTAGGGACATTTTTCTTTGAATTAGAGCCTCAGCAGCTCGTCATTGCTGAATTTTCTTGAAGTTGCCTCGTTTCTAGGCGCTT
>contig 00002
ACGCTGGTCGTAGCCGGTTTCTGACTACGAATGATTTAAAAATTTAGTTTGTCATCTATGCTTCGTACATAGATAAAGATAATAATGCCTTTCCATCTATTTTAGATTGAGAGAGGCTTTTTCTGTCATGATTAAATTAG
>contig 00003
GAAAGGAGTGAGCACTACTTTGGAAAATATTGAAGAATTATGGAGCGCTACATTAAAAAAAATTGAAGAAAAATTGAGTAAACCTAGCTTTGATACGTGGCTTAAAAATACAAAAGCAGAAGCATTAGAAAAAGATACAC
>contig 00004
TAATTATTTCTGCTCCAAATGAATTCGCAAGAGATTGGTTAGAAAATCAATACACGAACTTAATTTCTCAGATGCTACTTGAAGTTACAGGATCTGAATTAAATACAAAATTTATCATTCCTGATTCATTAGAAGAAATA
>contig 00005
GAAGAGCAAAAGCCTATGCCAAAACCAAAGCAGTCTACAGATACAGGTGACTCACCAAAATCGATGCTTAACTCAAAATATACTTTTGATACATTTGTCATTGGAGCTGGTAATCGTTTCGCTCATGCTGCTTCATTAGC

and so on...

What command should I use?

Best!

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

thanks Frederic, it works... thanks alot.

ADD REPLY
1
Entering edit mode

Cross posted (to Linkedin???)

ADD REPLY
1
Entering edit mode
8.5 years ago
Cytosine ▴ 460

Don't know of any scripts that do that, but it's really easy to write up one yourself. Try this one with using BioPython...

from Bio import SeqIO
import sys

fastafile, length = sys.argv[1], int(sys.argv[2])
entries = SeqIO.parse(fastafile, "fasta")

for entry in entries:
    print ">"+entry.name
    print entry.seq[:length]

Save it to a file (e.g. parse_fasta.py) and then run it in a terminal (or cmd if you're on windows) with:

python parse_fasta.py <input.fasta> <sequence_lengths> > <output.fasta>
ADD COMMENT
0
Entering edit mode

Or slightly more elegantly using SeqIO.write and a generator expression,

from Bio import SeqIO
import sys
fastafile, length = sys.argv[1], int(sys.argv[2])
trimmed = (entry[:length] for entry in SeqIO.parse(fastafile, "fasta"))
SeqIO.write(trimmed, sys.stdout, "fasta")
ADD REPLY
0
Entering edit mode

I am not a bioinformatician, so I don't know where to use this command, I am using CentOS Linux system, if I open terminal window, can I directly type the above script or I need to download any package before using the script.

Thanks.

ADD REPLY
1
Entering edit mode

Running this is as simple as saving the script to a file (like parse_fasta.py from my post) and running the python interpreter over that script (with the commands I mentioned at the end of my post).

If you have a fresh installation, you'll probably be missing the Biopython library. In that case you can install it by following the steps mentioned here: http://centosn00b.blogspot.be/2012/05/biopython-tip.html

Remember to prepend "sudo" to the commands if you do not have administrator privileges.

ADD REPLY
1
Entering edit mode

Good information, but I would caution against advising beginning users to "make things work" but adding sudo.

ADD REPLY
0
Entering edit mode
8.5 years ago

pip install pyfaidx

python

And then paste the script, changing your file name.

ADD COMMENT

Login before adding your answer.

Traffic: 754 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6