Question: How to reformat (i.e. to clean) NCBI .fasta archives into a singleline .fasta with only the unique identifier before each seqeunce?
0
gravatar for johnnytam100
9 months ago by
johnnytam100100
johnnytam100100 wrote:

Hi, I have just downloaded the NCBI nr protein sequences from here. Opening the unzipped file, it looks like this:

>S18 [Lactococcus lactis subsp. lactis]^AATZ02303.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]^APLW60021.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]^AAUS70574.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]^APPA66113.1 30S ribosomal protein S18 [Lactococcus lactis]^ABBC75095.1 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris]^AAWN66876.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]^ASPS10927.1 30S ribosomal protein S18 [Lactococcus lactis]^ARDG21709.1 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris]^AAXN66482.1 SSU ribosomal protein S18P [Lactococcus lactis subsp. cremoris]^ARHJ25897.1 30S ribosomal protein S18 [Lactococcus lactis]^ARJK90210.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]
MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDQ
N
>XP_642131.1 hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]^AP54670.1 RecName: Full=Calfumirin-1; Short=CAF-1^ABAA06266.1 calfumirin-1 [Dictyostelium discoideum AX2]^AEAL68086.1 hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]
MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEY
KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQK
VQKLLNPDQ
>XP_642837.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]^AEAL68957.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]
MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVAINGWILVGRMKKSSKKAQYE
DFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQPYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKR
IEKQKNHIVIGGSSLNSLPVSLPTVKSNESNESNTISINSENNNSKVSTDDTINNVM
>WP_000184067.1 MULTISPECIES: MbtH family protein [Bacillus]^ANP_844755.1 hypothetical protein BA_2373 [Bacillus anthracis str. Ames]^AYP_028470.1 hypothetical protein BAS2209 [Bacillus anthracis str. Sterne]^AYP_036475.1 balhimycin biosynthetic protein MbtH [[Bacillus thuringiensis] serovar konkukian str. 97-27]^AAAP26241.1 mbtH-like protein [Bacillus anthracis str. Ames]^AAAT31492.1 mbtH-like protein [Bacillus anthracis str. 'Ames Ancestor']^AAAT54521.1 mbtH-like protein [Bacillus anthracis str. Sterne]^AAAT62162.1 MbtH protein [[Bacillus thuringiensis] serovar konkukian str. 97-27]^AABK85418.1 mbtH-like protein [Bacillus thuringiensis str. Al Hakam]^AEDR19165.1 mbtH-like protein [Bacillus anthracis str. A0488]^AEDR87721.1 mbtH-like protein [Bacillus anthracis str. A0193]^AEDR94244.1 mbtH-like protein [Bacillus anthracis str. A0442]^AEDS97287.1 mbtH-like protein [Bacillus anthracis str. A0389]^AEDT19705.1 mbtH-like protein [Bacillus anthracis str. A0465]^AEDT69654.1 mbtH-like protein [Bacillus anthracis str. A0174]^AEDV17672.1

How could I reformat the file to a singleline .fasta (to remove the ^A etc.) with only the unique identifier (i.e. without any additional information e.g. species name etc.) before each seqeunce?

>identifier_1
seq1
>identifier_2
seq2
>identifier_3
seq3

Thanks in advance!!!

bash linux fasta ncbi • 351 views
ADD COMMENTlink modified 9 months ago by Chirag Parsania1.4k • written 9 months ago by johnnytam100100
2
gravatar for finswimmer
9 months ago by
finswimmer11k
Germany
finswimmer11k wrote:

An awk solution:

$ awk -v RS=">" -v FS="\n" -v OFS="\n" '$0 != "" {seq = ""; split($1, name, " "); for(i=2;i<=NF;i++) {seq = seq$i}; print ">"name[1], seq}' input.fa > output.fa

fin swimmer

ADD COMMENTlink modified 9 months ago • written 9 months ago by finswimmer11k

Thank you so much!!!

ADD REPLYlink written 9 months ago by johnnytam100100
1
gravatar for Anima Mundi
9 months ago by
Anima Mundi2.4k
Italy
Anima Mundi2.4k wrote:

A Python 2.7 solution:

import sys

header = ''
seq = ''

j = 0
for line in open(sys.argv[1]):
    j += 1

n = 0
for line in open(sys.argv[1]):
    n += 1
    if line[0] == '>':
        print seq
        seq = ''
        for char in line:
            if char != ' ':
                header += char
            else:
                print header
                header = ''
                break
    elif n == j:
        seq += line.replace('\n','')
        print seq
    else:
        seq += line.replace('\n','')
ADD COMMENTlink written 9 months ago by Anima Mundi2.4k

Thank you so much!!!

ADD REPLYlink written 9 months ago by johnnytam100100
1
gravatar for Jung Soh
9 months ago by
Jung Soh10
Graz, Austria
Jung Soh10 wrote:

A solution using the seqtk toolkit:

seqtk seq -Cl0 in.fasta > out.fasta

The -C option drops the comment (what follows the ID on the header line) and the -l option indicates the sequence line length with 0 representing a maximum of 2^32-1.

ADD COMMENTlink written 9 months ago by Jung Soh10
1
gravatar for Chirag Parsania
9 months ago by
Chirag Parsania1.4k
University of Macau
Chirag Parsania1.4k wrote:

R solution

library(Biostrings)
aa_fasta_file <- Biostrings::readAAStringSet(filepath = "~/Downloads/ff.fasta")

## remove everything after first space in header 
names(aa_fasta_file) <- gsub("\\s.*" , "" , names(aa_fasta_file)) 

aa_fasta_file
> aa_fasta_file
  A AAStringSet instance of length 3
    width seq                                                                                                                                names               
[1]    81 MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDQN                                                  S18
[2]   169 MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITI...KDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ XP_642131.1
[3]   217 MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVAINGW...YFKFKRIEKQKNHIVIGGSSLNSLPVSLPTVKSNESNESNTISINSENNNSKVSTDDTINNVM XP_642837.1

Biostrings::writeXStringSet(aa_fasta_file , filepath = "path/to/save/filename.fasta")
ADD COMMENTlink written 9 months ago by Chirag Parsania1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1613 users visited in the last hour