Question

Extracting accession number from header using sed

0

Entering edit mode

6.6 years ago

ToastedGoat ▴ 10

Hello! I'm trying to figure out how to extract the accession numbers from the headers. (about 120 headers) I have to use sed and can't seem to figure it out. Here is a sample of what my file looks like:

>Ref.49_cpx.GM.03.N26677.HQ385479 

ATGAGAGTGATGGAGACATGGATG-------------ATTTGCAAAATTG
G------TGG---------------------------AGAGGGGGTCTC

I need the part after the last period in the header. So the "HQ385479" part. Thanks in advance for the help!

sed accession number • 3.6k views

ADD COMMENT • link updated 6.6 years ago by GenoMax 141k • written 6.6 years ago by ToastedGoat ▴ 10

0

Entering edit mode

cut -d '.' -f6 input.txt

ADD REPLY • link 6.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I need the part after the last period in the header.

Do you want to keep the rest of the alignments intact? I assume so but please clarify.

Edit: Looks like you want to keep just the accessions based on a response below.

You could do (if all accession lines start with Ref) grep "^>Ref" input.txt | sed 's/^.*\.//g' > accession

ADD REPLY • link 6.6 years ago by GenoMax 141k

0

Entering edit mode

Ah yes completely forgot I could use grep first. Thanks.

ADD REPLY • link 6.6 years ago by ToastedGoat ▴ 10

1

Entering edit mode

Not necessary to use grep. input (copy/pasted the first sequence and changed the id at the end, as second sequence):

>Ref.49_cpx.GM.03.N26677.HQ385479 
ATGAGAGTGATGGAGACATGGATG-------------ATTTGCAAAATTG
G------TGG---------------------------AGAGGGGGTCTC
>Ref.49_cpx.GM.03.N26677.HQ385478
ATGAGAGTGATGGAGACATGGATG-------------ATTTGCAAAATTG
G------TGG---------------------------AGAGGGGGTCTC

output:

$ sed -e  '/>/!d; s/.*\.//g' test.fa 
HQ385479 
HQ385478

ADD REPLY • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

Thank you so much for this!

ADD REPLY • link 6.6 years ago by ToastedGoat ▴ 10

GenoMax · Answer 1 · 2017-09-05

1

Entering edit mode

6.6 years ago

Joe 21k

Since the OP has specifically requested sed

$  sed -i 's/^.*\.//g' input.txt

Gives:

HQ385479

ATGAGAGTGATGGAGACATGGATG-------------ATTTGCAAAATTG
G------TGG---------------------------AGAGGGGGTCTC

ADD COMMENT • link 6.6 years ago by Joe 21k

0

Entering edit mode

Thanks! Is there a way within the sed command to remove all the nucleotide sequences as well so I'm just left with all the accession numbers? This is my first time doing any bioinformatics and I am still learning the whole programming/coding side of it all.

ADD REPLY • link 6.6 years ago by ToastedGoat ▴ 10

0

Entering edit mode

You can use this:

awk -F. 'NF>1{print $NF}' input.txt > output.txt

ADD REPLY • link updated 6.6 years ago by GenoMax 141k • written 6.6 years ago by bk11 ★ 2.3k

1

Entering edit mode

For future reference: hightlight the text you want to format as code and then click on the "101" button in the edit window to apply the formatting.

ADD REPLY • link 6.6 years ago by GenoMax 141k

0

Entering edit mode

Is your file a fasta formatted file? (Header lines beginning with >)? Or is it exactly as you posted above?

ADD REPLY • link 6.6 years ago by Joe 21k

0

Entering edit mode

It begins with > Didn't copy in correctly

ADD REPLY • link 6.6 years ago by ToastedGoat ▴ 10

1

Entering edit mode

I would just chain it to grep personally, but now the solution is getting a bit less elegant.

cat input.txt | grep ">" | sed 's/^.*\.//g'

ADD REPLY • link 6.6 years ago by Joe 21k

0

Entering edit mode

very minor change to code. Please add > as replacement. So that sequence is still in fasta format.

$ sed -e 's/^.*\./>/g' test1.fa 
>HQ385479 

ATGAGAGTGATGGAGACATGGATG-------------ATTTGCAAAATTG
G------TGG---------------------------AGAGGGGGTCTC

ADD REPLY • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

The OP said he doesn't want the sequence, just the accession itself (I assuming they're making a list for a table or similar), so there's no need to sub in the ">".

ADD REPLY • link 6.6 years ago by Joe 21k

0

Entering edit mode

okay. didn't read OP in full :)

ADD REPLY • link 6.6 years ago by cpad0112 21k

GenoMax · Answer 2 · 2017-09-05

0

Entering edit mode

6.6 years ago

bk11 ★ 2.3k

 awk -F. 'NF>1{print $NF}' input.txt

ADD COMMENT • link updated 6.6 years ago by GenoMax 141k • written 6.6 years ago by bk11 ★ 2.3k