I have a FASTA file in this format:
>ENSMUSG00000035352|ENSMUSG00000035352.4|ENSMUST00000000194|ENSMUST00000000194.4
AGAGACACTGGTTCCTGACTCCTCTAGCTTTCATTTCGAAGTCTTTGACCTCAACATGAA
GATTTCCACACTTCTATGCCTCCTGCTCATAGCTACCACCATCAGTCCTCAGGTATTGGC
TGGACCAGGTAAGGCTTCTCCTTCCCCGATCTCTCAGCACATTCCAGAGCTGCTGCATGG
TCTGCAGTTTTGAAGAGACTCGGGTTACAAGAGGAAATAAAAAGAGTCATTGTACTTAGT
GACATAGTGAACCACAACCAAGTGTTCAGAGCACTGAGTCCAGGCCCTGTGATGTTTCGT
GACACCAGCTCTGGGAAAAAGATGTTGAATGCAGCCTCTACAGAAACTGCCATCTCCCTG
TTTGGAGAGGTGACCCAGAAGTGTCTCTGTAGTGAGAAGGACATCTCAGCAATAGAGGAG
TAAAAGATATGATGGCTTGAAGAAGAGTTTCAGGGTCATAGTCCAGAGTGCCTTCAAGAG
CAGCAGCCACATGTAGATACTAGAGATTCTTCTTAAACTCTAAGCCCACAGCAGCAAGGT
GGCAAACAGCAAGTTCTTAGAACTTCTTTTCCTGTGTCAGAAATTGATGGGATTTTTCCA
TATGGAATTAACAGCAAGTACATATTCTATAATATTCTGTGACCAGGCTCTAGATACAGA
AGTTGGGAGCCTTAACTCTTAGGTATAGTCCAGCATTCTTCCTTCCCTTGTGAGAGCACC
CTGCCATGTCTCCTAACTGTCTCTCTCCTTGCAAAATTATTTTCAGATGCGGTGAGCACC
CCAGTCACGTG
.
.
.
This format repeats itself throughout the file, which contains hundreds of such sequences. I would like to extract specific information from each sequence and write it to two new files. First, I want to get just the Sequence ID
(from the example >ENSMUSG00000035352
) and the first 200 base-pairs for each sequence, and write that to a new file. Second, I want to do the exact same thing, but only for the last 200 base-pairs of each sequence. The format would thus look like this:
First 200-file:
>ENSMUSG00000035352
AGAGACACTGGTTCCTGACTCCTCTAGCTTTCATTTCGAAGTCTTTGACCTCAACATGAA
GATTTCCACACTTCTATGCCTCCTGCTCATAGCTACCACCATCAGTCCTCAGGTATTGGC
TGGACCAGGTAA
>ENSMUSGxxxxxxxxxxxx
~~first 200 bp's of next sequence
.
.
.
Last 200-file:
>ENSMUSG00000035352
AGTTGGGAGCCTTAACTCTTAGGTATAGTCCAGCATTCTTCCTTCCCTTGTGAGAGCACC
CTGCCATGTCTCCTAACTGTCTCTCTCCTTGCAAAATTATTTTCAGATGCGGTGAGCACC
CCAGTCACGTG
>ENSMUSGxxxxxxxxxxxx
~~last 200 bp's of next sequence
.
.
.
Does anyone know of a way to achieve this? I feel that it might be doable using regular expressions in Python, but I am not sure how to write out a regex that could handle these specifications for each sequence since there are so many in the same file. If there are any pipelines or tools that could carry out this work other than Python regex's, I would be happy to hear about them. Thank you for your time and help!