Question: how to split a big embl files based on IDs
1
gravatar for Mehmet
18 months ago by
Mehmet460
Japan
Mehmet460 wrote:

Dear All,

I have a big embl file that has over 1000 IDs, and I want to split it based on IDs. Each ID should be a separate file containing all information of related ID, and should have the ID name.

How to do that?

sequence assembly gene • 711 views
ADD COMMENTlink modified 18 months ago by cpad011211k • written 18 months ago by Mehmet460

Split one embl file into several

ADD REPLYlink written 18 months ago by lessismore610

I know this. What I want is to split big embl file and write output of each ID in a output file based on ID.

ADD REPLYlink written 18 months ago by Mehmet460
2
gravatar for Alex Reynolds
18 months ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

One option:

$ awk -v RS="ID " -v FS="\n" {'n = split($1, a, ";"); sub(/ /, "_", a[1]); a[1] = a[1]".embl"; outFn = a[1]; record = "ID "$0; printf("%s", record) > outFn;'} ./in.embl

Given test input:

$ more test.embl
ID AA03518 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
   aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
   tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 
   ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
   tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//
ID AA03519 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
   aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
   tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 
   ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
   tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//
ID AA03520 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
   aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
   tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 
   ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
   tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//
ID AA03521 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
   aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
   tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 
   ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
   tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//

Run the script:

$ awk -v RS="ID " -v FS="\n" {'n = split($1, a, ";"); sub(/ /, "_", a[1]); a[1] = a[1]".embl"; outFn = a[1]; record = "ID "$0; printf("%s", record) > outFn;'} ./test.embl

This creates four files, one for each record, named by the record's ID:

$ ls -al AA*.embl
-rw-r--r--  1 areynolds  wheel   519 Oct 26 15:01 AA03518_standard.embl
-rw-r--r--  1 areynolds  wheel   519 Oct 26 15:01 AA03519_standard.embl
-rw-r--r--  1 areynolds  wheel   519 Oct 26 15:01 AA03520_standard.embl
-rw-r--r--  1 areynolds  wheel   519 Oct 26 15:01 AA03521_standard.embl

For example:

$ more AA03520_standard.embl 
ID AA03520 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
   aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
   tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 
   ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
   tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//

If IDs are not uniquely namable, then this could cause previous results to get overwritten. It may be safer to use >> instead of > within the awk statement.

ADD COMMENTlink modified 18 months ago • written 18 months ago by Alex Reynolds28k

the script "$ awk -v RS="ID... is exactly what I need. I don't know how many ID is found in my embl file. So, how can I split each ID using the script?

ADD REPLYlink written 18 months ago by Mehmet460
1

This awk script will create how ever many files as there are ID lines, at most. See the use of redirection operator > inside the awk script. Each record is written to its own file (so long as the record ID is unique).

If you want to count ahead of time how many ID files you should expect to come out of this script, you could run grep /$ID/ in.embl | wc -l.

The grep statement returns lines that start with ID, and wc -l counts them.

ADD REPLYlink written 18 months ago by Alex Reynolds28k

Thank you very much. I generated, but each embl file stars with _ symbol. how can I generate embl file without starting _ symbol ?

ADD REPLYlink written 18 months ago by Mehmet460

It would help to look at your EMBL-formatted file and see what changes are needed to the awk script.

Can you post the first 10 ID lines to your question? You could run something like:

$ grep /$ID/ in.embl | head -10 > firstTenIDs.txt

Then edit your question and add the contents of firstTenIDs.txt to it, so that we can see what you're working with.

ADD REPLYlink written 18 months ago by Alex Reynolds28k

This is one of embl files produced after the script.

ID scaffold00001 scaffold00001; ; ; DNA; ; UNC; 6279 BP. XX AC scaffold00001; XX DE scaffold00001 XX OS . OC . XX FH Key Location/Qualifiers FT CDS join(371..871,936..953) FT /source="EVM" FT /locus_tag="species1_s00001.1" FT CDS join(2234..2326,2407..2652,2880..3080) FT /source="EVM" FT /locus_tag="species1_s00001.2" FT CDS complement(join(5766..5891,5280..5603,4956..5138)) FT /source="EVM" FT /locus_tag="species1_s00001.3" SQ Sequence 6279 BP; 1875 A; 1232 C; 1136 G; 1736 T; 300 other; TATTCGTATt cAGAAGAAGA GCAGCTGAGT AGTCATCAAA GCGAAGAAAT AAAGTTGCCG 60 GCCCAAGTAG TATAGTGGTT AGTATTCCAC GTTGTGGCCG TGGCGACACA GGTTCGATTC 120 CTGTCTTGGG CAAGTGTtCT TTTTtGAGTT ACAgTtCGGG GGgAACTAAC CcAGAaTTGT 180 TTATTGAGGA TGGTCGTATT TtACTAtACT TTTTTtAAAA GACCGCGAAA ATTTATTTTC 240 AAATCTGTTG CAAAATTTTA TGAGATAACT TTTTCCAATT TCTTTtGGGG TTTTCAGCCC 300

  • List item

ADD REPLYlink modified 18 months ago • written 18 months ago by Mehmet460

But I would like to have as this one:

ID scaffold00001; SV 1; linear; unassigned DNA; STD; UNC; 6279 BP. XX FH Key Location/Qualifiers FH FT CDS 371..871 FT CDS 936..953 FT CDS 371..953 FT CDS 2234..2326 FT CDS 2407..2652 FT CDS 2880..3080 FT CDS complement(4956..5138) FT CDS complement(5280..5603) FT CDS complement(5766..5891) FT CDS 2234..3080 FT CDS complement(4956..5891) XX SQ Sequence 6279 BP; 1875 A; 1232 C; 1136 G; 1736 T; 300 other; TATTCGTATt cAGAAGAAGA GCAGCTGAGT AGTCATCAAA GCGAAGAAAT AAAGTTGCCG 60 GCCCAAGTAG TATAGTGGTT AGTATTCCAC GTTGTGGCCG TGGCGACACA GGTTCGATTC 120 CTGTCTTGGG CAAGTGTtCT TTTTtGAGTT ACAgTtCGGG GGgAACTAAC CcAGAaTTGT 180 TTATTGAGGA TGGTCGTATT TtACTAtACT TTTTTtAAAA GACCGCGAAA ATTTATTTTC 240 AAATCTGTTG CAAAATTTTA TGAGATAACT TTTTCCAATT TCTTTtGGGG TTTTCAGCCC 300

ADD REPLYlink written 18 months ago by Mehmet460

also before each output file, there are two character blanks. For instance, blank blank contig09311.embl

ADD REPLYlink written 18 months ago by Mehmet460

Sorry, I write this as a mistake. Please do not consider this file.

ADD REPLYlink written 18 months ago by Mehmet460

how to remove blank before name of each embl file? blank blank contig09311.embl

ADD REPLYlink written 18 months ago by Mehmet460

there are two letter space in from of outputted embl file. How can I delete the space?

ADD REPLYlink written 18 months ago by Mehmet460

Can you post the first 10 ID lines to your question? You could run something like:

$ grep /$ID/ in.embl | head -10 > firstTenIDs.txt

Then edit your question and add the contents of firstTenIDs.txt to it, so that we can see what you're working with.

You can add four spaces in front of each line to put the text into “code” form.

ADD REPLYlink modified 18 months ago • written 18 months ago by Alex Reynolds28k

I typed this command; grep /$ID/ in.embl | head -10 > firstTenIDs.txt but the output is empty

  contig03585.embl
ADD REPLYlink written 18 months ago by Mehmet460

Please post your file somewhere it can downloaded and I’ll take a look.

ADD REPLYlink written 18 months ago by Alex Reynolds28k

in the awk script, there is something that causes blank in from of each outputted embl file.

for instance, assume that xx in front of the .embl file are blank line. I just need to remove blanks in from of the outputted embl file. xxcontig01701.embl

ADD REPLYlink written 18 months ago by Mehmet460

sorry. When I run awk script, each outputted embl file looks like below:

_ contig00436.embl

how to remove _ in front of output files?

ADD REPLYlink written 18 months ago by Mehmet460
2
gravatar for cpad0112
18 months ago by
cpad011211k
India
cpad011211k wrote:

@OP: please post first embl record for better parsing.

Note: Before proceeding, create a test directory, copy the original embl (test.embl below) and run the script and check the output for random files.

OP embl ID line:

ID scaffold00001; SV 1; linear; unassigned DNA; STD; UNC; 6279 BP.
  

based on OP ID line,

$ grep -i "ID" test.embl |  awk '{gsub(";",""); print $2}' |  parallel "awk '/{}/,/\/\//' test.embl > {}.embl"

Above script should create multiple embl files with ID as file name.

Example embl file from post Split one embl file into several (close to OP embl IDs):

$ cat test.embl 
ID   comp0_c0_seq1; SV 1; linear; unassigned DNA; STD; UNC; 205 BP.
XX
DE   len=205 path=[1:0-135 1445:136-204]
XX
SQ   Sequence 205 BP; 64 A; 54 C; 31 G; 56 T; 0 other;
     GTATTGAACT GCAGAGCATT AAATGCTGCA ACTCAGTGCT TAGAATTCAT TAGATTCAGA        60
     GCAACGAACC CTAAATACTG AGCTGTCCCA TTAAATACTC TGCAGTTCAA TACTTAGCAT       120
     TCACCATTAA ACATAACACT TCCCGAGTTT CCACCATCCA TAAACAGCAG GCATTGTAAC       180
     CTGTAGGCTC TCTCCACGGT TACCT                                             205
//
ID   comp0_c0_seq2; SV 1; linear; unassigned DNA; STD; UNC; 205 BP.
XX
DE   len=205 path=[4094:0-135 1445:136-204]
XX
SQ   Sequence 205 BP; 59 A; 50 C; 35 G; 61 T; 0 other;
     AGAGTATTAA ATGTTGCAGT TCAGTGCTTA AAATTTATTG GATTCAGAGA ATCTTCAAAT        60
     TCAACGGACC CTAAACACTG AGCTGTCGCA TTAAATGCTC TGCAGTTCAA TGCTTAGCTT       120
     TCACCATTAA GCATAGCACT TCCCGAGTTT CCACCATCCA TAAACAGCAG GCATTGTAAC       180
     CTGTAGGCTC TCTCCACGGT TACCT                                             205
//
ID   comp1_c0_seq1; SV 1; linear; unassigned DNA; STD; UNC; 244 BP.
XX
DE   len=244 path=[3:0-88 875:89-243]
XX
SQ   Sequence 244 BP; 71 A; 51 C; 63 G; 59 T; 0 other;
     GCAGAATTTA AGGCTATGAA TCAGGAGGTT CATAATTCCT TAAGGAGGGG AGTATGATGC        60
     GGAGCATCCA CGCTCACCTC CACTCCACCG CATTGTCTTC GAGCTGTGAC AGCCAGCGCA       120
     TAATATTCAA GAGCTATTGA CAGGTGTTGA AACGCGGCAG CCTTGCATAC TATTGAAGGA       180
     CCACGTTTCA TTATTGTGAT CTATAAGAAG ACAGCTGATG CGATCATGAG GAAGGAAGAA       240
     GGCT                                                                    244
//

output:

$ ls *.embl
comp0_c0_seq1.embl  comp0_c0_seq2.embl  comp1_c0_seq1.embl  test.embl

.

$ more comp0_c0_seq1.embl 
ID   comp0_c0_seq1; SV 1; linear; unassigned DNA; STD; UNC; 205 BP.
XX
DE   len=205 path=[1:0-135 1445:136-204]
XX
SQ   Sequence 205 BP; 64 A; 54 C; 31 G; 56 T; 0 other;
     GTATTGAACT GCAGAGCATT AAATGCTGCA ACTCAGTGCT TAGAATTCAT TAGATTCAGA        60
     GCAACGAACC CTAAATACTG AGCTGTCCCA TTAAATACTC TGCAGTTCAA TACTTAGCAT       120
     TCACCATTAA ACATAACACT TCCCGAGTTT CCACCATCCA TAAACAGCAG GCATTGTAAC       180
     CTGTAGGCTC TCTCCACGGT TACCT                                             205
//
$ more comp1_c0_seq1.embl 
ID   comp1_c0_seq1; SV 1; linear; unassigned DNA; STD; UNC; 244 BP.
XX
DE   len=244 path=[3:0-88 875:89-243]
XX
SQ   Sequence 244 BP; 71 A; 51 C; 63 G; 59 T; 0 other;
     GCAGAATTTA AGGCTATGAA TCAGGAGGTT CATAATTCCT TAAGGAGGGG AGTATGATGC        60
     GGAGCATCCA CGCTCACCTC CACTCCACCG CATTGTCTTC GAGCTGTGAC AGCCAGCGCA       120
     TAATATTCAA GAGCTATTGA CAGGTGTTGA AACGCGGCAG CCTTGCATAC TATTGAAGGA       180
     CCACGTTTCA TTATTGTGAT CTATAAGAAG ACAGCTGATG CGATCATGAG GAAGGAAGAA       240
     GGCT                                                                    244
//
ADD COMMENTlink modified 18 months ago • written 18 months ago by cpad011211k

Thank you very much for your help.

ADD REPLYlink written 18 months ago by Mehmet460

If this answer has helped you can "accept" it (green check mark) to provide closure to this thread.

ADD REPLYlink written 18 months ago by genomax65k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2371 users visited in the last hour