Question: Converting Files Newline With Specific Pattern
0
gravatar for Assa Yeroslaviz
6.1 years ago by
Assa Yeroslaviz1.2k
Munich
Assa Yeroslaviz1.2k wrote:

Hi,

I would like to know how to remove the newline from a certain part of my file, but not all of it.

I am piping the result of my program into sed in order to convert the file to a specific format. The input file looks like that:

>sctg_0002_0001  length=2745
TCCCCCTCCCGTACCGGTTTGCGCTATTATACCGGCCTTGAATCGAGCAAAGGCTCCAAACAATTTCATTACAAACAGATTGGGGATGTATGACGTGGCT
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
TTGACACGCTTGTTTCTGATGTCATCACCCATGAAGAGCTGTTATTTGGCCACCTGGCGTTCCTGCCTAAGCGTTGAGTGAATATTAAACACCTCTGCCC
>sctg_0003_0001  length=2175
CAACAACCACTCTTAGCGCTGCTTGCCGCTGCCGATACCGAACGGGATGCGGTAGTCGCTGCTCTGCTCACCCAGACTCACGGTCAGGTTGCCCTGAGTA
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
...

This is what I need to do:

  • convert the ">" symbol into the string "SEQUENCE_ID="

  • remove everything after the double spaces in the header, e.g. " length=2745"

  • add part of the next header into the actual one - this should looks like that sctg00020001e00030001b

    • 0002_0001 is part of the first header
    • 0003_0001 is part of the second header.
  • delete the newline symbols from the sequence itself as if to make the fastA in one single line.

  • add the string "SEQUENCE_TEMPLATE=" to the sequence line.

  • add the symbol "=" after each sequence line

This is what I have done so far

perl convert_FastA.pl ScaffoldContigs.fasta | sed -e '/^>/  s/>/SEQUENCE_ID=/'  | sed -e ':a;N;$!ba;/^SEQUENCE_ID=/ !  s/\n//'

the results of the first part looks like the sample above. rst sed command replace the ">" with the pattern needed.

At the end it should look like that:

SEQUENCE_ID=sctg_0002_0001e_0003_0001b
SEQUENCE_TEMPLATE=CCCCCTCCCGTACCGGTTTGCGCTA...
=
SEQUENCE_ID=sctg_0003_0001e_0001_0001b
SEQUENCE_TEMPLATE=CAACAACCACTCTTAGCGCTGCTTG... 
=

I tried to delete the newline with sed, but it didn't work how I imgined it. It delete either all of them or none. Besides I couldn't find any way to "save" the next line in order to put it in the header of the sequence before that.

I would appreciate any help I can get.

Thanks, Assa

ADD COMMENTlink modified 6.1 years ago by Kenosis1.2k • written 6.1 years ago by Assa Yeroslaviz1.2k
1

I don't completely understand the part where you are adding the next header into the current one. So you just want to add the next header in the current header separated by the letter 'e' and append a letter 'b' to the end?

It might be better to just write a script for something complicated like this.

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Damian Kao15k
1

The title and tags associated with your post do not cover the load at all. For future reference, a better use of tags and titles helps to get the right people looking at your question :)

ADD REPLYlink written 6.1 years ago by Whetting1.5k

Your specs say, add part of the next header into the actual one. What do you do when there is no next header, i.e., you're processing header n?

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Kenosis1.2k

the last one takes the second part of the first header, but I can probably do it manually, if not available by script.

ADD REPLYlink written 6.1 years ago by Assa Yeroslaviz1.2k
2
gravatar for Kenosis
6.1 years ago by
Kenosis1.2k
Kenosis1.2k wrote:

Here's a Perl option that processes your fasta records in one pass and uses Bio::SeqIO:

use strict;
use warnings;
use v5.10;
use Bio::SeqIO;

my $in = Bio::SeqIO->new( -fh => \*ARGV, -format => 'Fasta' );
my ( @last, $firstHeader );

while ( my $seq = $in->next_seq() ) {
    if ( !@last ) {
        @last = ( $seq->id, $seq->seq );
        next;
    }

    my $currHeader = $seq->id;
    $currHeader =~ s/sctg/e/;
    $firstHeader //= $currHeader;

    print formatRec( $currHeader, @last );
    @last = ( $seq->id, $seq->seq );
}

print formatRec( $firstHeader, @last );

sub formatRec {
    my ( $currHeader, $secID, $secSec ) = @_;
    return "SEQUENCE_ID=$secID$currHeader"
      . "b\nSEQUENCE_TEMPLATE=$secSec\n=\n";
}

Usage: perl script.pl fastaFile [>outFile]

Sample output on your data (no newlines in the sequences):

SEQUENCE_ID=sctg_0002_0001e_0003_0001b
SEQUENCE_TEMPLATE=TCCCCCTCCCGTACCGGTTTGCGCTATTATACCGGCCTTGAATCGAGCAAAGGCTCCAAACAATTTCATTACAAACAGATTGGGGATGTATGACGTGGCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTGACACGCTTGTTTCTGATGTCATCACCCATGAAGAGCTGTTATTTGGCCACCTGGCGTTCCTGCCTAAGCGTTGAGTGAATATTAAACACCTCTGCCC
=
SEQUENCE_ID=sctg_0003_0001e_0004_0001b
SEQUENCE_TEMPLATE=CAACAACCACTCTTAGCGCTGCTTGCCGCTGCCGATACCGAACGGGATGCGGTAGTCGCTGCTCTGCTCACCCAGACTCACGGTCAGGTTGCCCTGAGTANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
=
SEQUENCE_ID=sctg_0004_0001e_0003_0001b
SEQUENCE_TEMPLATE=TCCCCCTCCCGTACCGGTTTGCGCTATTATACCGGCCTTGAATCGAGCAAAGGCTCCAAACAATTTCATTACAAACAGATTGGGGATGTATGACGTGGCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTGACACGCTTGTTTCTGATGTCATCACCCATGAAGAGCTGTTATTTGGCCACCTGGCGTTCCTGCCTAAGCGTTGAGTGAATATTAAACACCTCTGCCC
=

The script processes the file passed to it on the command line, passing its handle for the creation of a Bio::SeqIO object. Output is directed to STDOUT, so you can view the results on your screen or have it sent to a file by using the optional last parameter.

It saves the last record for printing. The first conditional statement checks if there is a last record. If not, it initializes the array that hold it and loops again. If a last record exists, then a substitution is done on the current header. The first header is stored for use later with the very last record.

A record is printed, the current record becomes the last one, and the next record is retrieved. The very last record is printed outside the while loop.

Hope this helps!

Edit:

Was inspired by Whetting's Python script and wanted to give such a try. The following Python script produces the same results as the Perl script:

import sys
import re
from Bio import SeqIO

handle = open(sys.argv[1], 'rU')
recs = [[rec.id, rec.seq] for rec in SeqIO.parse(handle, 'fasta')]
handle.close()

firstHeader = recs[1][0]

for i in range(0, len(recs)):
    addID = recs[i + 1][0] if i < len(recs) - 1 else firstHeader
    recs[i][0] += re.sub(r'sctg', r'e', addID)
    print 'SEQUENCE_ID=' + recs[i][0] + 'b\nSEQUENCE_TEMPLATE=' \
        + recs[i][1] + '\n='

Usage: python script.py fastaFile [>outFile]

Instead of a single pass, the above first builds a list comprehension of IDs and sequences. The header used in the last record is first saved in firstHeader. The records are iterated over and a ternary expression is used to determine which ID to add to the current record, as the last record gets firstHeader. Next a substitution on the ID is done and the results are concatenated to the current record's ID. Finally, the record is printed.

ADD COMMENTlink modified 6.1 years ago • written 6.1 years ago by Kenosis1.2k

Passing a filehandle to seqio is very useful :). I recommend adding use v5.10; to the top of your script because the defined-or operator that you use (i.e., //=) wasn't available until Perl 5.10. Alternatively, you could use the old conditional operator, but I would use the same syntax you have and just check that it's available.

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by SES8.2k

Good suggestion about adding that pragma. Have done so...

ADD REPLYlink written 6.1 years ago by Kenosis1.2k

Thanks a lot for the script. I have written something on my own, which does at least the converting steps of the analysis. But this one is way better. I didn't know how to save the objects for later use, but now I do. :-) I will try it and let you know how it went.

ADD REPLYlink written 6.1 years ago by Assa Yeroslaviz1.2k

Hi,

this line gives me trouble: my $currHeader = $seq->id =~ s/sctg/e/r; "syntax error at convertFastA2primer3format.pl line 15, near "s/sctg/e/r" Execution of convertFastA2primer3format.pl aborted due to compilation errors." => waht does r do?

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Assa Yeroslaviz1.2k

You're most welcome!

The r modifier returns a copy of the string. This was implemented in v5.14, so you're likely experiencing a backwards-compatibility issue. Have updated that substitution segment in the above script, removing that modifier.

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Kenosis1.2k

thanks. It works splendid.

ADD REPLYlink written 6.1 years ago by Assa Yeroslaviz1.2k

Glad it worked for you!

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Kenosis1.2k
1
gravatar for Whetting
6.1 years ago by
Whetting1.5k
Bethesda, MD
Whetting1.5k wrote:

Hi,
I do not know perl well enough to do this. however, I have some basic knowledge of python.
This script does what you want...

from Bio import SeqIO
out=open("out.txt","a")
ID=[]
for record in SeqIO.parse("input.fas","fasta"):
    ID.append("_".joinstrrecord.id).rsplit("_")[1:]))

n=1
for record in SeqIO.parse("input.fas","fasta"):
    if n==len(ID):
        #print >>out, n
        print >>out, "SEQUENCE_ID="+record.id+"e_"+ID[0]+"b"
        print >>out, "SEQUENCE_TEMPLATE ="+record.seq
        print >>out, "="
        n=n+1
        break
    else:
        #print >>out, n
        print >>out, "SEQUENCE_ID="+record.id+"e_"+ID[n]+"b"
        print >>out, "SEQUENCE_TEMPLATE ="+record.seq
        print >>out, "="
        n=n+1
out.close()

It will take your input in the file "input.fas"

and it returns:

SEQUENCE_ID=sctg_0002_0001e_0003_0001b
SEQUENCE_TEMPLATE =TCCCCCTCCCGTACCGGTTTGCGCTATTATACCGGCCTTGAATCGAGCAAAGGCTCCAAACAATTTCATTACAAACAGATTGGGGATGTATGACGTGGCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTGACACGCTTGTTTCTGATGTCATCACCCATGAAGAGCTGTTATTTGGCCACCTGGCGTTCCTGCCTAAGCGTTGAGTGAATATTAAACACCTCTGCCC
=
SEQUENCE_ID=sctg_0003_0001e_0004_0001b
SEQUENCE_TEMPLATE =CAACAACCACTCTTAGCGCTGCTTGCCGCTGCCGATACCGAACGGGATGCGGTAGTCGCTGCTCTGCTCACCCAGACTCACGGTCAGGTTGCCCTGAGTANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
=
SEQUENCE_ID=sctg_0004_0001e_0005_0001b
SEQUENCE_TEMPLATE =CAACAACCACTCTTAGCGCTGCTTGCCGCTGCCGATACCGAACGGGATGCGGTAGTCGCTGCTCTGCTCACCCAGACTCACGGTCAGGTTGCCCTGAGTANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
=
SEQUENCE_ID=sctg_0005_0001e_0002_0001b
SEQUENCE_TEMPLATE =CAACAACCACTCTTAGCGCTGCTTGCCGCTGCCGATACCGAACGGGATGCGGTAGTCGCTGCTCTGCTCACCCAGACTCACGGTCAGGTTGCCCTGAGTANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
=

ps: The new lines are gone. sequence is being split because I copy-pasted from terminal.

ADD COMMENTlink modified 6.1 years ago • written 6.1 years ago by Whetting1.5k

thanks, I will try it. I don't have any preferences. I choose Perl by chance :-) It is not a big file and it could have bin done manually, but I wanted to know for the future, if it is possible to automate.

ADD REPLYlink written 6.1 years ago by Assa Yeroslaviz1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1383 users visited in the last hour