search text in one file and then replace with text from another file
4
0
Entering edit mode
4.1 years ago
mforthman ▴ 40

I have two DNA sequence files, one generated by a company based off a data file I sent them. The company's file has different sequence headers for some sequences than the data file I sent, and it's important that all of it conforms to the format I sent for certain programs to use it. Is there a way that I can search a part or all of the sequence header in the company files and then replace the entire line with the corresponding header from my original data file? Additional notes: 1) the company reverse complemented some of the sequences, and this was necessary. Thus, I do not want to alter the sequences from the company file, just get the headers looking like those from the original one. 2) Those sequences that were reverse complemented have an _rc appended to the end of the headers.

>uce-265_p7-|design:hemiptera-v1,designer:faircloth_rc
TGAGTATTCAATATTCCCTGCGCAATATTCAATGGACATACATGGCTATGTTCTTGTTTATTCTATTACATCACTTAAGTCATTCGAAGTTGTGCAGGTCATTTATGAAAAGTTGCTCGA


Original file

>uce-265_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-265,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold391,probes-global-start:41871,probes-global-end:41991,probes-local-start:0,probes-local-end:120
TCGAGCAACTTTTCATAAATGACCTGCACAACTTCGAATGACTTAAGTGATGTAATAGAATAAACAAGAACATAGCCATGTATGTCCATTGAATATTGCGCAGGGAATATTGAATACTCA


The company's headers should look like the original. My initial thought is an if/then loop with grep, but I'm having trouble imaging how this would work in this case.

search replace multiple files • 1.7k views
0
Entering edit mode

Correct, sequences were reverse complemented by the company as well in many cases.

0
Entering edit mode

What was the reason for reverse complementing the sequences?

0
Entering edit mode

prevent cross-hybridization potential

0
Entering edit mode

Not sure what you are referring to?

0
Entering edit mode

Posted a new solution to my initial answer, after you gave extra information - thanks!

2
Entering edit mode
4.1 years ago

One way to do this may be to attempt (try) to parse on the pipe character, and then just print out the header and sequence as-is if there is no split:

#!/usr/bin/env python

import sys

original_fn = sys.argv[1]
company_fn = sys.argv[2]

map = {}

with open(original_fn, "r") as original_fh:
for line in original_fh:
if line.startswith('>'):
try:
(k, v) = line.strip().split('|')
# remove trailing space from key
k = k[:-1]
map[k] = v
except ValueError as err:
k = line.strip()
map[k] = None

with open(company_fn, "r") as company_fh:
for line in company_fh:
if line.startswith('>'):
try:
(k, v) = line.strip().split('|')
# remove trailing character from key
k = k[:-1]
except ValueError as err:
k = line.strip()
if k not in map:
sys.stdout.write("%s\n" % (k))
else:
sys.stdout.write("%s |%s\n" % (k, map[k]))
else:
sys.stdout.write("%s" % (line))

0
Entering edit mode

PERFECT! Thank you and everyone else for all of your help.

0
Entering edit mode

How might this script be modified to deal with other headers that are formatted differently in the company.fasta file? Specifically, I'm targeting the following example headers now:

>Clavigralla_tomentosicollis_gi_512427358_gb_GAJX01007276.1_0_rc
TGAGAAGCTCTCGCAGTCACACTGGCCCACGACCACTCTCAGATATCACCTATATTGATGAAGACCCTGTGAAAACTCCATCGCCACACAATAACTCCCTAAACAATCCACGACTGAATC
>Anasa_tristis_comp3229_c0_seq1_0_rc
AGGGCTTGTGATTCCCTTGAGCACATCGCAAGCCTCTGTTCTAGACAAAACATTCCACATTTGGTCAATAATGCTTTTGGTTTGCAAAGTGCACGTCTCATGCATTTAATTCAAGAGGCT
>Anoplocnemis_curvipes_gi_512406097_gb_GAJV01010223.1_70
ATCCTCTGTCTGACTTGAAATGGAAAGAAGTGAAACGTATGGCCCTTCATGAAATGGTTGAATATGTCACCACATACCATGATGTTATTACTGAAGTCATTTATCCTGAAGCTGTTAATA
>Alydus_pilosus_comp24037_c0_seq1_0
TTAAAAGAATTCAGAGTTGAGCAGTGCTCCTTGTTTCTCCAGCACAAATGTACACAACATCGTCCTTTCACCTGCTTTCATTGGCATTTTATGAATCAGAGACGTCGGAGGCCTGTAAGA


Some similarly formatted headers also include _A_ or _B_ before the last digit in the sequence names. I don't want these headers to be targeted. Example of such headers:

>Boisea_trivittata_comp15501_c0_seq2_A_39
CTGCGGTAACTGGTTCAGGGCCCCAGCCGCCCGCTCAGCCACCTACTTCAACGTCAGCAGACCCTGAGAAAAGGAAGCTCATCCAACAACAGCTCGTCCTCCTCCTGCATGCCCACAAGT
>Boisea_trivittata_comp15501_c0_seq2_A_78
CACCTACTTCAACGTCAGCAGACCCTGAGAAAAGGAAGCTCATCCAACAACAGCTCGTCCTCCTCCTGCATGCCCACAAGTGTCAAAGAAGAGAAACACAATCCAACGGTGAAAATATAC


The corresponding headers in the original.fasta file have the following example formats (these don't match with above examples):

>OFAS000119-RA-EXON01 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000119-RA-EXON01,probes-probe:,probes-source:Alydus_pilosus_comp16418_c0_seq1
>OFAS000280-RA-EXON05 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000280-RA-EXON05,probes-probe:,probes-source:Boisea_trivittata_comp12981_c0_seq1
>OFAS000975-RA-EXON05 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS000975-RA-EXON05,probes-probe:,probes-source:Anoplocnemis_curvipes_gi_512414378_gb_GAJV01001942.1
>OFAS001601-RA-EXON05 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS001601-RA-EXON05,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512430460_gb_GAJX01004174.1

0
Entering edit mode

You could use a regex (regular expression) that tests for _A_n or _B_n at the end of the header string, where n is any number made of one or more digits.

If the regex test finds a match, then we don't process that header.

The pattern of your headers look like they could be matched with the expression _[AB]_[0-9]+$. To break down this regex: 1) We start with an _ underscore 2) [AB] is a match with either A or B 3) Another _ underscore 4) [0-9]+ is a match with one or more (+) digits 0 through 9 5) $ anchors the match to the end of the search string — in this case, the end of the company header

Here's some untested code:

#!/usr/bin/env python

import sys
import re

pattern = '_[AB]_[0-9]+$' header_that_should_not_match = 'Clavigralla_tomentosicollis_gi_512427358_gb_GAJX01007276.1_0_rc' header_that_should_match = 'Boisea_trivittata_comp15501_c0_seq2_A_78' if re.search(pattern, header_that_should_not_match): sys.stderr.write("1 - Something went wrong!\n") else: sys.stdout.write("1 - Correct!\n") if re.search(pattern, header_that_should_match): sys.stdout.write("2 - Correct!\n") else: sys.stderr.write("2 - Something went wrong!\n")  I can't really copy and paste what I did before, because your headers have changed. But basically you only need to add the import re statement, the pattern, and incorporate the if re.search(...) test into the block of code where you are testing company header matches. If the test passes, then skip over reformatting the header and just print the line, e.g. something like: ... with open(company_fn, "r") as company_fh: for line in company_fh: if line.startswith('>') and not re.search(pattern, line.strip()): # reformat header here # ... else: sys.stdout.write("%s" % (line)) ...  ADD REPLY 0 Entering edit mode Thanks for the info. I've modified the regex pattern to also skip over other headers that shouldn't be modified, e.g., like the ones in the original posting (they are all in one file). If I tested correctly, the new pattern is (uce | ENSOFAS | _[AB]_[0-9]+$). Regarding the last part of your reply, I'm not sure I understand what is suppose to go here; I could be very wrong, but it looks like I'm suppose to insert code that formats the company.fasta file headers in a specific way where you say # reformat header here. Instead, I'm wanting to replace them with the ones in the original.fasta file. Also, the code block before the with open(company_fn, "r") as company_fh: seems to split up the headers in the original.fasta file with the | as a delimiter. My guess is that this was to help set up the search/replace between the strings before the | between files. For the new set of headers, the only thing that matches between the files is the taxon/seq IDs. I guess my second question is would I still use line.strip().split() with multiple punctuation delimiters (and how I would do that), and somehow indicate the delimited taxon name as the function to match to between company.fasta and original.fasta (I'm guessing this ultimately is something like the map[k] = v.

I was playing around and thought this might be close, but when I test it, it doens't modify the company.fasta file. Sorry if none of what I attempted makes any sense:

#!/usr/bin/env python

import sys
import re

original_fn = sys.argv[1]
company_fn = sys.argv[2]

pattern1 = '(uce | ENSOFAS | _[AB]_[0-9]+$| _[AB]_[0-9]_rc+$)'
pattern2 = '(_[0-9]+_rc$| _[0-9]+$)'
pattern3 = 'probes-source:'

map = {}

with open(original_fn, "r") as original_fh:
for line in original_fh:
if line.startswith('>'):
try:
(k, v) = line.strip().split(pattern3)
# remove trailing space from key
#v = v[:-1]
map[v] = k
except ValueError as err:
v = line.strip()
map[v] = None

with open(company_fn, "r") as company_fh:
for line in company_fh:
if line.startswith('>') and not re.search(pattern1, line.strip()):
try:
(k, v) = line.strip().split(pattern2)
# remove trailing character from key
#v = v[:-1]
except ValueError as err:
k = line.strip()
if k not in map:
sys.stdout.write("%s\n" % (k))
else:
sys.stdout.write("%s |%s\n" % (k, map[v]))
else:
sys.stdout.write("%s" % (line))

0
Entering edit mode

Your headers and tests have changed since the original question, and I'm a bit lost. Could you please post your (pre-regex, post-header-changes) script and I could look at how you might modify that to add a regex pattern search test?

0
Entering edit mode

Correct, I kept the original question simple in original post. In reality, the company.fasta file had many differently formatted headers. I used the script you had provided to process just the uce headers, which worked. I already had the ENSOFAS headers. This leaves the others to be formatted. To show this, here are the kinds of headers that are present in company.fasta after using your script for processing uce headers:

>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_103_rc
CATTGCAGCAACTAACAGAGTTGATATATTAGATCCAGCCCTTCTCCGATCAGGCAGGCTAGACAGAAAAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAAT
>Anasa_tristis_comp3229_c0_seq1_0_rc
AGGGCTTGTGATTCCCTTGAGCACATCGCAAGCCTCTGTTCTAGACAAAACATTCCACATTTGGTCAATAATGCTTTTGGTTTGCAAAGTGCACGTCTCATGCATTTAATTCAAGAGGCT
>ENSOFAS009761_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS009761,probes-probe:2,probes-source:Anoplocnemis_curvipes_contig5129
TTAAGAATCTCGAGAAAACCCCTCAGGATGATGAATTACTTGAAATATATGCTCTCTATAAACAAGCAACTGTAGGAGACTGTGACACAAGTAAGCCTGGGATGTTTGATTTCAAAGGGA1
>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
>Anasa_tristis_comp8051_c0_seq1_A_0
ATCCTCCTGATTGGGCAGAAATTTTGAACCATTTTCGAGGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAA
>Anasa_tristis_comp8051_c0_seq1_B_0
TAACGTCCTAGGTTAGGTTTCTGTTTACCAGCTAAAATCTTGAGGGCTGTAGACTTTCCAATGCCATTAGTTCCAACCAGACCTAAAACTTCTCCTGGTCTTGGAATTGGAAGTCTGTGG


I'm working with a different original.fasta, which have the following format:

>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
CATTGCAGCAACTAACAGAGTTGATATATTAGATCCAGCCCTTCTCCGATCAGGCAGGCTAGACAGAAAAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAAT
>OFAS016134-RA-EXON02 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS016134-RA-EXON02,probes-probe:,probes-source:Anasa_tristis_comp3229_c0_seq1
AGGGCTTGTGATTCCCTTGAGCACATCGCAAGCCTCTGTTCTAGACAAAACATTCCACATTTGGTCAATAATGCTTTTGGTTTGCAAAGTGCACGTCTCATGCATTTAATTCAAGAGGCT


You reply earlier today helps to skip those headers in the company.fasta file that have an _[AB]_[0-9]+$. I took this and inserted into your working script, as you had suggested. However, I also need to make sure it is skipping the uce and ENSOFAS headers and any _[AB]_[0-9] headers that also ended in _rc, so I added these to the regex pattern for this purpose: #!/usr/bin/env python import sys import re original_fn = sys.argv[1] company_fn = sys.argv[2] pattern = (uce | ENSOFAS | _[AB]_[0-9]+$ | _[AB]_[0-9]_rc+$) map = {} with open(original_fn, "r") as original_fh: #remainder of original code...  The code I previously posted was just an example of me playing around in hopes of stumbling upon the answer or having an epiphany, but I'm obviously not getting there. In the end, the modified company file should look like: >OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1 CATTGCAGCAACTAACAGAGTTGATATATTAGATCCAGCCCTTCTCCGATCAGGCAGGCTAGACAGAAAAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAAT >OFAS016134-RA-EXON02 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS016134-RA-EXON02,probes-probe:,probes-source:Anasa_tristis_comp3229_c0_seq1 AGGGCTTGTGATTCCCTTGAGCACATCGCAAGCCTCTGTTCTAGACAAAACATTCCACATTTGGTCAATAATGCTTTTGGTTTGCAAAGTGCACGTCTCATGCATTTAATTCAAGAGGCT >ENSOFAS009761_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS009761,probes-probe:2,probes-source:Anoplocnemis_curvipes_contig5129 TTAAGAATCTCGAGAAAACCCCTCAGGATGATGAATTACTTGAAATATATGCTCTCTATAAACAAGCAACTGTAGGAGACTGTGACACAAGTAAGCCTGGGATGTTTGATTTCAAAGGGA1 >uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120 AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT >Anasa_tristis_comp8051_c0_seq1_A_0 ATCCTCCTGATTGGGCAGAAATTTTGAACCATTTTCGAGGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAA >Anasa_tristis_comp8051_c0_seq1_B_0 TAACGTCCTAGGTTAGGTTTCTGTTTACCAGCTAAAATCTTGAGGGCTGTAGACTTTCCAATGCCATTAGTTCCAACCAGACCTAAAACTTCTCCTGGTCTTGGAATTGGAAGTCTGTGG  ADD REPLY 1 Entering edit mode I'll try to take a look in a little while, but this pattern is probably not what you want: _[AB]_[0-9]_rc+$


This would find matches to

..._A_2_rcccccc
..._B_9_rcccc


But not:

..._A_6_rc
..._B_2829_rc


_[AB]_[0-9]+_rc$ In this second pattern, by moving the + after the digits, you can match a number with one or more digits. ADD REPLY 1 Entering edit mode You are correct, that was my mistake. Thanks for catching that! And I appreciate you taking a further look at this later. ADD REPLY 0 Entering edit mode I've still been tinkering with the script and feel as though I might be getting closer to the solution: #!/usr/bin/env python import sys import re original_fn = sys.argv[1] company_fn = sys.argv[2] pattern = '(uce.+$|ENSOFAS.+$|[AB]_[0-9]+$)'

map = {}

with open(original_fn, "r") as original_fh:
for line in original_fh:
if line.startswith('>'):
try:
(k, v) = line.strip().rsplit(':',1)
# remove trailing space from key
#k = k[:-1]
map[k] = v
#print k
#print v
#print map[k]
except ValueError as err:
k = line.strip()
map[k] = None

with open(company_fn, "r") as company_fh:
for line in company_fh:
if line.startswith('>') and not re.search(pattern, line.strip()):
try:
line=line.strip('>')
(v, k) = line.strip().rsplit('_',1)
# remove trailing character from key
#k = k[:-1]
#print k
#print v
except ValueError as err:
k = line.strip()
if v not in map:
sys.stdout.write("%s\n" % (v))
else:
sys.stdout.write("%s |%s\n" % (v, map[k]))
else:
sys.stdout.write("%s" % (line))


Currently, this will modify the targeted headers, but what it does is delete the > (because I line.striped it in order to get the taxon/seq IDs as the key) and the last underscore and anything beyond it. It doesn't replace the header with the corresponding original.fasta header yet.

0
Entering edit mode
4.1 years ago
5heikki 10k

HEADER<TAB>SEQUENCE


format and use join.

edit. this was assuming that the sequences were the same. I guess that's not the case.

0
Entering edit mode

Correct, sequences were reverse complemented by the company as well in many cases.

0
Entering edit mode
4.1 years ago

Read the sequence header of the original file into a hash table (also called an associative array or dictionary), and process the company's result based on that table.

Untested, but I think this should work or at least get you close. This assumes your inputs look like your example, and that your FASTA files are single-line (i.e. header on one line, sequence on the next line, not split across lines).

#!/usr/bin/env python

import sys

original_fn = sys.argv[1]
company_fn = sys.argv[2]

map = {}

with open(original_fn, "r") as original_fh:
for line in original_fh:
if line.startswith('>'):
(k, v) = line.strip().split('|')
# remove trailing space from key
k = k[:-1]
map[k] = v

with open(company_fn, "r") as company_fh:
skip = False
for line in company_fh:
if line.startswith('>'):
(k, v) = line.strip().split('|')
# remove trailing character from key
k = k[:-1]
if k not in map:
skip = True
else:
sys.stdout.write("%s |%s\n" % (k, map[k]))
else:
if not skip:
sys.stdout.write("%s" % (line))
skip = False


Then you'd run it something like so:

then
grep -B1 -w -e "${line}" original.fasta ; fi } done  # If the company has indeed reverse-complemented the sequences #!/bin/bash cat company.fasta | while read L; do echo$L; read L; echo "$L" | rev | tr "ATGC" "TACG" ; done > company_RevComp.fasta paste company_RevComp.fasta | while read line do { if [[ "${line}" != ">"* ]] ;
then
grep -B1 -w -e "${line}" original.fasta ; fi } done  NB - the reverse-complement command is Pierre's: fasta - reverse complement sequence ADD COMMENT 0 Entering edit mode The idea is to not alter the sequences in the company's file itself, but just the headers of these sequences, some of which have _rc at the end (to indicate it is reverse complement form what I sent), while others do not. ADD REPLY 0 Entering edit mode Absolutely no problem mate. I presume that you're working with immunological clonal sequences. Try this [original.fasta is your original data; company.fasta is the company's data]: #Version 1 #!/bin/bash cat original.fasta | while read L; do echo$L; read L; echo "$L" | rev | tr "ATGC" "TACG" ; done > original_RevComp.fasta ; paste company.fasta | while read line do { if [[ "${line}" != ">"* ]] ;
then
grep -B1 -w -e "${line}" original.fasta original_RevComp.fasta | sed 's/^[a-zA-Z0-9_]*.fasta[-:]//g' ; fi } done ####################################### ####################################### #Version 2 #!/bin/bash cat original.fasta | while read L; do echo$L; read L; echo "$L" | rev | tr "ATGC" "TACG" ; done > original_RevComp.fasta ; paste company.fasta | while read line do { if [[ "${line}" != ">"* ]] ;
then
grep -B1 -w -e "${line}" original.fasta ; grep -B1 -w -e "${line}" original_RevComp.fasta ;
fi
}
done

0
Entering edit mode

I've tried executing Version 1 it has done something interesting. It produced the original_RevComp.fasta, but it has outputted the following:

>uce-5419_p14 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-5419,probes-probe:14,probes-source:pacven1,probes-global-chromo:Scaffold2635,probes-global-start:22761,probes-global-end:22881,probes-local-start:40,probes-local-end:160
> TCAGATGTTGGGAGACCCATTTCTTTTTGTCGTTGGTCATACATCATTTTCTCGACAAGA
> CGAACAGAAAAAACAAAGTGTTCTGCAAAAGTTTATGGAGAGCCATCCCGAAATGGACTT
> 021:dne-lacol-seborp,0:trats-lacol-seborp,092491:dne-labolg-seborp,071491:trats-labolg-seborp,987265LC:omorhc-labolg-seborp,1corpohr:ecruos-seborp,51:eborp-seborp,9145-ecu:sucol-seborp,htolcriaf:rengised,1v-aretpimeh:ngised|
> 51p_9145-ecu>
> AAATCAATCCGGAACCTTCAAAACTCTCAGATCTTGACGGAGAAACCCGTGGGATGGTTG
> GTTTTTCATCCGATGTTGGTAGTCCTAGTTCTTTTTGTCTCTGATCATACATCATCTTTT
> >uce-5419_p16 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-5419,probes-probe:16,probes-source:rhoproc1,probes-global-chromo:GL562789,probes-global-start:194210,probes-global-end:194330,probes-local-start:40,probes-local-end:160
> AGTCCTAGTTCTTTTTGTCTCTGATCATACATCATCTTTTCAACCATCCCACGGGTTTCT
> ACCAACATCGGATGAAAAACAAAAAAGAGATATGCTGGCAAAGTATGTGCTTTAAGTTTT


In every other line, it outputs the header backwards. For the ones in which the header is backwards, the sequence isn't reverse complemented.

The script then seems stuck after completing original_RevComp.fasta with no output or changes observed to the company file.

*Note: just spotted the potential issue, original file had the 120 bps split to 60 bp in two lines instead of having it all on one.

0
Entering edit mode

Yes, I was worried that some of your sequences were going to be split onto multiple lines.

Execute this on original.fasta before running my script, and save it as a new input FASTA:

awk '/^>/{print "\n"$0}; !/^>/{printf$0}' original.fasta | sed '/^$/d'  That will put all sequences onto single lines. ADD REPLY 0 Entering edit mode I did get the sequences onto single lines, separate from the headers and this fixed the problem. Now, the commands are printing the replaced headers to screen. I tried inserting -i in the sed command, but this gives an error -i may not be used with stdin. I was under the impression that -i modifies the file rather than producing new output or printing to screen, but I guess I'm wrong? What's going to screen is definitely correct, just need it to modify the file. ADD REPLY 0 Entering edit mode I cannot see your screen, so, can't see exactly what's happening. The better version of my code to run is probably Version 2. The code above should change the header's in the company's file with the original header(?). With sed, you cannot use -i in redirect command. -i is used when you wan to edit a file in place without redirection. e.g. sed -i 's/a/b/g' file.text will convert all 'a' to 'b' and modify the file in place ADD REPLY 0 Entering edit mode @Kevin is likely using GNU sed. mforthman : Are you using macOS? Default sed in macOS is different. You could install the GNU version easily though. ADD REPLY 0 Entering edit mode Yes, I use Ubuntu 14.04. To OP, I would save a new input file for my scripts such as: awk '/^>/{print "\n"$0}; !/^>/{printf $0}' original.fasta | sed '/^$/d' > original.new.fasta


...and then modify the filenames in my script by changing original.fasta to original.new.fasta

0
Entering edit mode

Yes, I'm using macOS. I have and am using gnu-sed version. I have been able to get the sequences all on one line and generate original_RevComp.fasta without issues. It's the next step that is not modifying the company.fasta file (or any file). Both version 1 and 2, the headers are printed to Terminal's screen rather than changing them in the company.fasta file. I can't copy and paste from Terminal into the file because the company.fasta file also includes other (but comparably fewer) sequences that I need to reformat the original headers before replacing the company headers with it (this is from combining different datasets from different sources). These other sequences don't start with 'uce-" or follow any similar format as the ones I'm currently targeting. An example of what gets printed to Terminal's screen:

>uce-5416_p12 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-5416,probes-probe:12,probes-source:oncfas1,probes-global-chromo:Scaffold1485,probes-global-start:20990,probes-global-end:21110,probes-local-start:40,probes-local-end:160
AAAATGTTTTGTGAAGATTAAGGTGCGATACATCTGACCCATAATGACCTGGGCAGCAAAGCCCTACTTGAGCATTCATATTTGCTCCATCCAAATAAACCTTGATAAAAATATCAAATA
>uce-5419_p11 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-5419,probes-probe:11,probes-source:oncfas1,probes-global-chromo:Scaffold51,probes-global-start:350828,probes-global-end:350948,probes-local-start:0,probes-local-end:120
TAAGATCAATCCAGAGCCATCCAAGCTGTCAGACCTCGATGGAGAAACTCGTGGGTTAGTGGAAAAGATGATGTATGATCAAAGACAACGAGAACTGGGCCTTCCAACCTCTGATGAACA
>uce-5419_p12 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-5419,probes-probe:12,probes-source:oncfas1,probes-global-chromo:Scaffold51,probes-global-start:350868,probes-global-end:350988,probes-local-start:40,probes-local-end:160
GGAGAAACTCGTGGGTTAGTGGAAAAGATGATGTATGATCAAAGACAACGAGAACTGGGCCTTCCAACCTCTGATGAACAAAAGAAAAAGGACATGCTACAGAAGTAAGTTTACAGAAAA
Owners-MacBook-Air:MyBaits_prelim_probe_kit_order Forthman$ ADD REPLY 0 Entering edit mode Yes, I programmed it to output to screen. I recommend running the script (either version 1 or version 2) and re-direct the output to a new FASTA file, like this: ./script.sh > company.new.fasta  ADD REPLY 0 Entering edit mode Here is an updated solution that does everything (puts sequence data onto single lines and also does everything else). Start with original.fasta (sequences on multiple lines if you wish) and the company.fasta. Keep extensions as fasta and not fa #Version 1 #!/bin/bash #Bring sequences on single lines in the input file awk '/^>/{print "\n"$0}; !/^>/{printf $0}' original.fasta | sed '/^$/d' > original.new.fasta

#Then convert to reverse complement
cat original.new.fasta | while read L; do  echo $L; read L; echo "$L" | rev | tr "ATGC" "TACG" ; done > original_RevComp.fasta ;

paste company.fasta | while read line
do
{
if [[ "${line}" != ">"* ]] ; then grep -B1 -w -e "${line}" original.new.fasta original_RevComp.fasta | sed 's/^[a-zA-Z0-9_\\.]*.fasta[-:]//g' ;
fi
}
done

#######################################
#######################################

#Version 2
#!/bin/bash

#Bring sequences on single lines in the input file
awk '/^>/{print "\n"$0}; !/^>/{printf$0}' original.fasta | sed '/^$/d' > original.new.fasta #Then convert to reverse complement cat original.new.fasta | while read L; do echo$L; read L; echo "$L" | rev | tr "ATGC" "TACG" ; done > original_RevComp.fasta ; paste company.fasta | while read line do { if [[ "${line}" != ">"* ]] ;
then
grep -B1 -w -e "${line}" original.new.fasta ; grep -B1 -w -e "${line}" original_RevComp.fasta ;
fi
}
done