Question: How to reformat (i.e. to clean) NCBI .fasta archives into a singleline .fasta with only the unique identifier before each seqeunce?
0
gravatar for johnnytam100
4 weeks ago by
johnnytam10090
johnnytam10090 wrote:

Hi, I have just downloaded the NCBI nr protein sequences from here. Opening the unzipped file, it looks like this:

>S18 [Lactococcus lactis subsp. lactis]^AATZ02303.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]^APLW60021.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]^AAUS70574.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]^APPA66113.1 30S ribosomal protein S18 [Lactococcus lactis]^ABBC75095.1 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris]^AAWN66876.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]^ASPS10927.1 30S ribosomal protein S18 [Lactococcus lactis]^ARDG21709.1 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris]^AAXN66482.1 SSU ribosomal protein S18P [Lactococcus lactis subsp. cremoris]^ARHJ25897.1 30S ribosomal protein S18 [Lactococcus lactis]^ARJK90210.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]
MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDQ
N
>XP_642131.1 hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]^AP54670.1 RecName: Full=Calfumirin-1; Short=CAF-1^ABAA06266.1 calfumirin-1 [Dictyostelium discoideum AX2]^AEAL68086.1 hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]
MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEY
KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQK
VQKLLNPDQ
>XP_642837.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]^AEAL68957.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]
MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVAINGWILVGRMKKSSKKAQYE
DFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQPYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKR
IEKQKNHIVIGGSSLNSLPVSLPTVKSNESNESNTISINSENNNSKVSTDDTINNVM
>WP_000184067.1 MULTISPECIES: MbtH family protein [Bacillus]^ANP_844755.1 hypothetical protein BA_2373 [Bacillus anthracis str. Ames]^AYP_028470.1 hypothetical protein BAS2209 [Bacillus anthracis str. Sterne]^AYP_036475.1 balhimycin biosynthetic protein MbtH [[Bacillus thuringiensis] serovar konkukian str. 97-27]^AAAP26241.1 mbtH-like protein [Bacillus anthracis str. Ames]^AAAT31492.1 mbtH-like protein [Bacillus anthracis str. 'Ames Ancestor']^AAAT54521.1 mbtH-like protein [Bacillus anthracis str. Sterne]^AAAT62162.1 MbtH protein [[Bacillus thuringiensis] serovar konkukian str. 97-27]^AABK85418.1 mbtH-like protein [Bacillus thuringiensis str. Al Hakam]^AEDR19165.1 mbtH-like protein [Bacillus anthracis str. A0488]^AEDR87721.1 mbtH-like protein [Bacillus anthracis str. A0193]^AEDR94244.1 mbtH-like protein [Bacillus anthracis str. A0442]^AEDS97287.1 mbtH-like protein [Bacillus anthracis str. A0389]^AEDT19705.1 mbtH-like protein [Bacillus anthracis str. A0465]^AEDT69654.1 mbtH-like protein [Bacillus anthracis str. A0174]^AEDV17672.1

How could I reformat the file to a singleline .fasta (to remove the ^A etc.) with only the unique identifier (i.e. without any additional information e.g. species name etc.) before each seqeunce?

>identifier_1
seq1
>identifier_2
seq2
>identifier_3
seq3

Thanks in advance!!!

bash linux fasta ncbi • 167 views
ADD COMMENTlink modified 28 days ago by Chirag Parsania1.2k • written 4 weeks ago by johnnytam10090
2
gravatar for finswimmer
4 weeks ago by
finswimmer6.9k
Germany
finswimmer6.9k wrote:

An awk solution:

$ awk -v RS=">" -v FS="\n" -v OFS="\n" '$0 != "" {seq = ""; split($1, name, " "); for(i=2;i<=NF;i++) {seq = seq$i}; print ">"name[1], seq}' input.fa > output.fa

fin swimmer

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by finswimmer6.9k

Thank you so much!!!

ADD REPLYlink written 4 weeks ago by johnnytam10090
1
gravatar for Anima Mundi
4 weeks ago by
Anima Mundi2.4k
Italy
Anima Mundi2.4k wrote:

A Python 2.7 solution:

import sys

header = ''
seq = ''

j = 0
for line in open(sys.argv[1]):
    j += 1

n = 0
for line in open(sys.argv[1]):
    n += 1
    if line[0] == '>':
        print seq
        seq = ''
        for char in line:
            if char != ' ':
                header += char
            else:
                print header
                header = ''
                break
    elif n == j:
        seq += line.replace('\n','')
        print seq
    else:
        seq += line.replace('\n','')
ADD COMMENTlink written 4 weeks ago by Anima Mundi2.4k

Thank you so much!!!

ADD REPLYlink written 4 weeks ago by johnnytam10090
1
gravatar for Jung Soh
28 days ago by
Jung Soh10
Graz, Austria
Jung Soh10 wrote:

A solution using the seqtk toolkit:

seqtk seq -Cl0 in.fasta > out.fasta

The -C option drops the comment (what follows the ID on the header line) and the -l option indicates the sequence line length with 0 representing a maximum of 2^32-1.

ADD COMMENTlink written 28 days ago by Jung Soh10
1
gravatar for Chirag Parsania
28 days ago by
Chirag Parsania1.2k
University of Macau
Chirag Parsania1.2k wrote:

R solution

library(Biostrings)
aa_fasta_file <- Biostrings::readAAStringSet(filepath = "~/Downloads/ff.fasta")

## remove everything after first space in header 
names(aa_fasta_file) <- gsub("\\s.*" , "" , names(aa_fasta_file)) 

aa_fasta_file
> aa_fasta_file
  A AAStringSet instance of length 3
    width seq                                                                                                                                names               
[1]    81 MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDQN                                                  S18
[2]   169 MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITI...KDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ XP_642131.1
[3]   217 MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVAINGW...YFKFKRIEKQKNHIVIGGSSLNSLPVSLPTVKSNESNESNTISINSENNNSKVSTDDTINNVM XP_642837.1

Biostrings::writeXStringSet(aa_fasta_file , filepath = "path/to/save/filename.fasta")
ADD COMMENTlink written 28 days ago by Chirag Parsania1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 770 users visited in the last hour