Question

Extracting portions of multiple sequences from one FASTA file

0

Entering edit mode

2.2 years ago

julienw • 0

I have a FASTA file in this format:

>ENSMUSG00000035352|ENSMUSG00000035352.4|ENSMUST00000000194|ENSMUST00000000194.4
AGAGACACTGGTTCCTGACTCCTCTAGCTTTCATTTCGAAGTCTTTGACCTCAACATGAA
GATTTCCACACTTCTATGCCTCCTGCTCATAGCTACCACCATCAGTCCTCAGGTATTGGC
TGGACCAGGTAAGGCTTCTCCTTCCCCGATCTCTCAGCACATTCCAGAGCTGCTGCATGG
TCTGCAGTTTTGAAGAGACTCGGGTTACAAGAGGAAATAAAAAGAGTCATTGTACTTAGT
GACATAGTGAACCACAACCAAGTGTTCAGAGCACTGAGTCCAGGCCCTGTGATGTTTCGT
GACACCAGCTCTGGGAAAAAGATGTTGAATGCAGCCTCTACAGAAACTGCCATCTCCCTG
TTTGGAGAGGTGACCCAGAAGTGTCTCTGTAGTGAGAAGGACATCTCAGCAATAGAGGAG
TAAAAGATATGATGGCTTGAAGAAGAGTTTCAGGGTCATAGTCCAGAGTGCCTTCAAGAG
CAGCAGCCACATGTAGATACTAGAGATTCTTCTTAAACTCTAAGCCCACAGCAGCAAGGT
GGCAAACAGCAAGTTCTTAGAACTTCTTTTCCTGTGTCAGAAATTGATGGGATTTTTCCA
TATGGAATTAACAGCAAGTACATATTCTATAATATTCTGTGACCAGGCTCTAGATACAGA
AGTTGGGAGCCTTAACTCTTAGGTATAGTCCAGCATTCTTCCTTCCCTTGTGAGAGCACC
CTGCCATGTCTCCTAACTGTCTCTCTCCTTGCAAAATTATTTTCAGATGCGGTGAGCACC
CCAGTCACGTG
.
.
.

This format repeats itself throughout the file, which contains hundreds of such sequences. I would like to extract specific information from each sequence and write it to two new files. First, I want to get just the Sequence ID (from the example >ENSMUSG00000035352) and the first 200 base-pairs for each sequence, and write that to a new file. Second, I want to do the exact same thing, but only for the last 200 base-pairs of each sequence. The format would thus look like this:

First 200-file:

>ENSMUSG00000035352
AGAGACACTGGTTCCTGACTCCTCTAGCTTTCATTTCGAAGTCTTTGACCTCAACATGAA
GATTTCCACACTTCTATGCCTCCTGCTCATAGCTACCACCATCAGTCCTCAGGTATTGGC
TGGACCAGGTAA
>ENSMUSGxxxxxxxxxxxx
~~first 200 bp's of next sequence
.
.
.

Last 200-file:

>ENSMUSG00000035352
AGTTGGGAGCCTTAACTCTTAGGTATAGTCCAGCATTCTTCCTTCCCTTGTGAGAGCACC
CTGCCATGTCTCCTAACTGTCTCTCTCCTTGCAAAATTATTTTCAGATGCGGTGAGCACC
CCAGTCACGTG
>ENSMUSGxxxxxxxxxxxx
~~last 200 bp's of next sequence
.
.
.

Does anyone know of a way to achieve this? I feel that it might be doable using regular expressions in Python, but I am not sure how to write out a regex that could handle these specifications for each sequence since there are so many in the same file. If there are any pipelines or tools that could carry out this work other than Python regex's, I would be happy to hear about them. Thank you for your time and help!

sequencing python regex DNA fasta • 814 views

ADD COMMENT • link updated 2.2 years ago by cpad0112 21k • written 2.2 years ago by julienw • 0

score 4 · Answer 1 · 2022-02-03

There are several ways. Easiest are:

with seqkit:

$ seqkit subseq -r 1:200 <input.fa> -o <output.fa> - For first two hundred
$ seqkit subseq -r -200:-1 <input.fa> -o <output.fa> - For last two hundred

with sed (if fasta sequences are in single line):

$ sed -r '/>/! s/^(.{200}).*/\1/' <input.fa> - For fist two hundred
$ sed -r '/>/! s/.*(.{200})$/\1/' <input.fa> - For last two hundred

with colrm (base distribution function on most of the *buntu) (works if headers are not more than 200 characters and sequences in single line:

$ colrm 201 < input.fa - For first two hundred
$ rev input.fa| colrm 201 | rev - For last two hundred

score 2 · Answer 2 · 2022-02-03

A variant of this question (changing fasta headers) gets asked at least once a week. If you search Biostars for the words I have in parenthesis, you will find a solution easily.

As to including only the first or last 200 residues, that may be somewhat unique. Still, it is simply a matter of reading a FASTA file and only saving a portion of the sequence. The code below needs BioPython, and will read the file specified as argument #1 and write the same file as argument #2. Note that variable name contains FASTA header up to the first space character, which can be manipulated to your liking before writing into a file. Likewise, the seq variable contains a sequence portion, which can also be manipulated to your liking before writing into a file. By adding 2-4 lines, you should be able to customize this so it does what you want.

import sys
from Bio import SeqIO

FastaFile = open(sys.argv[1], 'r')
FastaOutFile = open(sys.argv[2], 'w')

for seqs in SeqIO.parse(FastaFile, 'fasta'):
    name = seqs.id
    seq = seqs.seq
    FastaOutFile.write('>' + str(name) + '\n')
    FastaOutFile.write(str(seq) + '\n')

FastaFile.close()
FastaOutFile.close()