Converting Files Newline With Specific Pattern
2
0
Entering edit mode
11.1 years ago
Assa Yeroslaviz ★ 1.8k

Hi,

I would like to know how to remove the newline from a certain part of my file, but not all of it.

I am piping the result of my program into sed in order to convert the file to a specific format. The input file looks like that:

>sctg_0002_0001  length=2745
TCCCCCTCCCGTACCGGTTTGCGCTATTATACCGGCCTTGAATCGAGCAAAGGCTCCAAACAATTTCATTACAAACAGATTGGGGATGTATGACGTGGCT
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
TTGACACGCTTGTTTCTGATGTCATCACCCATGAAGAGCTGTTATTTGGCCACCTGGCGTTCCTGCCTAAGCGTTGAGTGAATATTAAACACCTCTGCCC
>sctg_0003_0001  length=2175
CAACAACCACTCTTAGCGCTGCTTGCCGCTGCCGATACCGAACGGGATGCGGTAGTCGCTGCTCTGCTCACCCAGACTCACGGTCAGGTTGCCCTGAGTA
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
...

This is what I need to do:

  • convert the ">" symbol into the string "SEQUENCE_ID="

  • remove everything after the double spaces in the header, e.g. " length=2745"

  • add part of the next header into the actual one - this should looks like that sctg00020001e00030001b

    • 0002_0001 is part of the first header
    • 0003_0001 is part of the second header.
  • delete the newline symbols from the sequence itself as if to make the fastA in one single line.

  • add the string "SEQUENCE_TEMPLATE=" to the sequence line.

  • add the symbol "=" after each sequence line

This is what I have done so far

perl convert_FastA.pl ScaffoldContigs.fasta | sed -e '/^>/  s/>/SEQUENCE_ID=/'  | sed -e ':a;N;$!ba;/^SEQUENCE_ID=/ !  s/\n//'

the results of the first part looks like the sample above. rst sed command replace the ">" with the pattern needed.

At the end it should look like that:

SEQUENCE_ID=sctg_0002_0001e_0003_0001b
SEQUENCE_TEMPLATE=CCCCCTCCCGTACCGGTTTGCGCTA...
=
SEQUENCE_ID=sctg_0003_0001e_0001_0001b
SEQUENCE_TEMPLATE=CAACAACCACTCTTAGCGCTGCTTG... 
=

I tried to delete the newline with sed, but it didn't work how I imgined it. It delete either all of them or none. Besides I couldn't find any way to "save" the next line in order to put it in the header of the sequence before that.

I would appreciate any help I can get.

Thanks, Assa

awk perl conversion programming primer • 3.5k views
ADD COMMENT
1
Entering edit mode

I don't completely understand the part where you are adding the next header into the current one. So you just want to add the next header in the current header separated by the letter 'e' and append a letter 'b' to the end?

It might be better to just write a script for something complicated like this.

ADD REPLY
1
Entering edit mode

The title and tags associated with your post do not cover the load at all. For future reference, a better use of tags and titles helps to get the right people looking at your question :)

ADD REPLY
0
Entering edit mode

Your specs say, add part of the next header into the actual one. What do you do when there is no next header, i.e., you're processing header n?

ADD REPLY
0
Entering edit mode

the last one takes the second part of the first header, but I can probably do it manually, if not available by script.

ADD REPLY
2
Entering edit mode
11.1 years ago
Kenosis ★ 1.3k

Here's a Perl option that processes your fasta records in one pass and uses Bio::SeqIO:

use strict;
use warnings;
use v5.10;
use Bio::SeqIO;

my $in = Bio::SeqIO->new( -fh => \*ARGV, -format => 'Fasta' );
my ( @last, $firstHeader );

while ( my $seq = $in->next_seq() ) {
    if ( !@last ) {
        @last = ( $seq->id, $seq->seq );
        next;
    }

    my $currHeader = $seq->id;
    $currHeader =~ s/sctg/e/;
    $firstHeader //= $currHeader;

    print formatRec( $currHeader, @last );
    @last = ( $seq->id, $seq->seq );
}

print formatRec( $firstHeader, @last );

sub formatRec {
    my ( $currHeader, $secID, $secSec ) = @_;
    return "SEQUENCE_ID=$secID$currHeader"
      . "b\nSEQUENCE_TEMPLATE=$secSec\n=\n";
}

Usage: perl script.pl fastaFile [>outFile]

Sample output on your data (no newlines in the sequences):

SEQUENCE_ID=sctg_0002_0001e_0003_0001b
SEQUENCE_TEMPLATE=TCCCCCTCCCGTACCGGTTTGCGCTATTATACCGGCCTTGAATCGAGCAAAGGCTCCAAACAATTTCATTACAAACAGATTGGGGATGTATGACGTGGCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTGACACGCTTGTTTCTGATGTCATCACCCATGAAGAGCTGTTATTTGGCCACCTGGCGTTCCTGCCTAAGCGTTGAGTGAATATTAAACACCTCTGCCC
=
SEQUENCE_ID=sctg_0003_0001e_0004_0001b
SEQUENCE_TEMPLATE=CAACAACCACTCTTAGCGCTGCTTGCCGCTGCCGATACCGAACGGGATGCGGTAGTCGCTGCTCTGCTCACCCAGACTCACGGTCAGGTTGCCCTGAGTANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
=
SEQUENCE_ID=sctg_0004_0001e_0003_0001b
SEQUENCE_TEMPLATE=TCCCCCTCCCGTACCGGTTTGCGCTATTATACCGGCCTTGAATCGAGCAAAGGCTCCAAACAATTTCATTACAAACAGATTGGGGATGTATGACGTGGCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTGACACGCTTGTTTCTGATGTCATCACCCATGAAGAGCTGTTATTTGGCCACCTGGCGTTCCTGCCTAAGCGTTGAGTGAATATTAAACACCTCTGCCC
=

The script processes the file passed to it on the command line, passing its handle for the creation of a Bio::SeqIO object. Output is directed to STDOUT, so you can view the results on your screen or have it sent to a file by using the optional last parameter.

It saves the last record for printing. The first conditional statement checks if there is a last record. If not, it initializes the array that hold it and loops again. If a last record exists, then a substitution is done on the current header. The first header is stored for use later with the very last record.

A record is printed, the current record becomes the last one, and the next record is retrieved. The very last record is printed outside the while loop.

Hope this helps!

Edit:

Was inspired by Whetting's Python script and wanted to give such a try. The following Python script produces the same results as the Perl script:

import sys
import re
from Bio import SeqIO

handle = open(sys.argv[1], 'rU')
recs = [[rec.id, rec.seq] for rec in SeqIO.parse(handle, 'fasta')]
handle.close()

firstHeader = recs[1][0]

for i in range(0, len(recs)):
    addID = recs[i + 1][0] if i < len(recs) - 1 else firstHeader
    recs[i][0] += re.sub(r'sctg', r'e', addID)
    print 'SEQUENCE_ID=' + recs[i][0] + 'b\nSEQUENCE_TEMPLATE=' \
        + recs[i][1] + '\n='

Usage: python script.py fastaFile [>outFile]

Instead of a single pass, the above first builds a list comprehension of IDs and sequences. The header used in the last record is first saved in firstHeader. The records are iterated over and a ternary expression is used to determine which ID to add to the current record, as the last record gets firstHeader. Next a substitution on the ID is done and the results are concatenated to the current record's ID. Finally, the record is printed.

ADD COMMENT
0
Entering edit mode

Passing a filehandle to seqio is very useful :). I recommend adding use v5.10; to the top of your script because the defined-or operator that you use (i.e., //=) wasn't available until Perl 5.10. Alternatively, you could use the old conditional operator, but I would use the same syntax you have and just check that it's available.

ADD REPLY
0
Entering edit mode

Good suggestion about adding that pragma. Have done so...

ADD REPLY
0
Entering edit mode

Thanks a lot for the script. I have written something on my own, which does at least the converting steps of the analysis. But this one is way better. I didn't know how to save the objects for later use, but now I do. :-) I will try it and let you know how it went.

ADD REPLY
0
Entering edit mode

Hi,

this line gives me trouble: my $currHeader = $seq->id =~ s/sctg/e/r; "syntax error at convertFastA2primer3format.pl line 15, near "s/sctg/e/r" Execution of convertFastA2primer3format.pl aborted due to compilation errors." => waht does r do?

ADD REPLY
0
Entering edit mode

You're most welcome!

The r modifier returns a copy of the string. This was implemented in v5.14, so you're likely experiencing a backwards-compatibility issue. Have updated that substitution segment in the above script, removing that modifier.

ADD REPLY
0
Entering edit mode

thanks. It works splendid.

ADD REPLY
0
Entering edit mode

Glad it worked for you!

ADD REPLY
1
Entering edit mode
11.1 years ago
Whetting ★ 1.6k

Hi,
I do not know perl well enough to do this. however, I have some basic knowledge of python.
This script does what you want...

from Bio import SeqIO
out=open("out.txt","a")
ID=[]
for record in SeqIO.parse("input.fas","fasta"):
    ID.append("_".joinstrrecord.id).rsplit("_")[1:]))

n=1
for record in SeqIO.parse("input.fas","fasta"):
    if n==len(ID):
        #print >>out, n
        print >>out, "SEQUENCE_ID="+record.id+"e_"+ID[0]+"b"
        print >>out, "SEQUENCE_TEMPLATE ="+record.seq
        print >>out, "="
        n=n+1
        break
    else:
        #print >>out, n
        print >>out, "SEQUENCE_ID="+record.id+"e_"+ID[n]+"b"
        print >>out, "SEQUENCE_TEMPLATE ="+record.seq
        print >>out, "="
        n=n+1
out.close()

It will take your input in the file "input.fas"

and it returns:

SEQUENCE_ID=sctg_0002_0001e_0003_0001b
SEQUENCE_TEMPLATE =TCCCCCTCCCGTACCGGTTTGCGCTATTATACCGGCCTTGAATCGAGCAAAGGCTCCAAACAATTTCATTACAAACAGATTGGGGATGTATGACGTGGCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTGACACGCTTGTTTCTGATGTCATCACCCATGAAGAGCTGTTATTTGGCCACCTGGCGTTCCTGCCTAAGCGTTGAGTGAATATTAAACACCTCTGCCC
=
SEQUENCE_ID=sctg_0003_0001e_0004_0001b
SEQUENCE_TEMPLATE =CAACAACCACTCTTAGCGCTGCTTGCCGCTGCCGATACCGAACGGGATGCGGTAGTCGCTGCTCTGCTCACCCAGACTCACGGTCAGGTTGCCCTGAGTANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
=
SEQUENCE_ID=sctg_0004_0001e_0005_0001b
SEQUENCE_TEMPLATE =CAACAACCACTCTTAGCGCTGCTTGCCGCTGCCGATACCGAACGGGATGCGGTAGTCGCTGCTCTGCTCACCCAGACTCACGGTCAGGTTGCCCTGAGTANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
=
SEQUENCE_ID=sctg_0005_0001e_0002_0001b
SEQUENCE_TEMPLATE =CAACAACCACTCTTAGCGCTGCTTGCCGCTGCCGATACCGAACGGGATGCGGTAGTCGCTGCTCTGCTCACCCAGACTCACGGTCAGGTTGCCCTGAGTANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
=

ps: The new lines are gone. sequence is being split because I copy-pasted from terminal.

ADD COMMENT
0
Entering edit mode

thanks, I will try it. I don't have any preferences. I choose Perl by chance :-) It is not a big file and it could have bin done manually, but I wanted to know for the future, if it is possible to automate.

ADD REPLY

Login before adding your answer.

Traffic: 2583 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6