Extracting accession number from header using sed
2
0
Entering edit mode
4.2 years ago
ToastedGoat ▴ 10

Hello! I'm trying to figure out how to extract the accession numbers from the headers. (about 120 headers) I have to use sed and can't seem to figure it out. Here is a sample of what my file looks like:

>Ref.49_cpx.GM.03.N26677.HQ385479

ATGAGAGTGATGGAGACATGGATG-------------ATTTGCAAAATTG
G------TGG---------------------------AGAGGGGGTCTC


I need the part after the last period in the header. So the "HQ385479" part. Thanks in advance for the help!

sed accession number • 2.3k views
0
Entering edit mode
cut -d '.' -f6 input.txt

0
Entering edit mode

I need the part after the last period in the header.

Do you want to keep the rest of the alignments intact? I assume so but please clarify.

Edit: Looks like you want to keep just the accessions based on a response below.

You could do (if all accession lines start with Ref) grep "^>Ref" input.txt | sed 's/^.*\.//g' > accession

0
Entering edit mode

Ah yes completely forgot I could use grep first. Thanks.

1
Entering edit mode

Not necessary to use grep. input (copy/pasted the first sequence and changed the id at the end, as second sequence):

>Ref.49_cpx.GM.03.N26677.HQ385479
ATGAGAGTGATGGAGACATGGATG-------------ATTTGCAAAATTG
G------TGG---------------------------AGAGGGGGTCTC
>Ref.49_cpx.GM.03.N26677.HQ385478
ATGAGAGTGATGGAGACATGGATG-------------ATTTGCAAAATTG
G------TGG---------------------------AGAGGGGGTCTC


output:

$sed -e '/>/!d; s/.*\.//g' test.fa HQ385479 HQ385478  ADD REPLY 0 Entering edit mode Thank you so much for this! ADD REPLY 1 Entering edit mode 4.2 years ago Joe 19k Since the OP has specifically requested sed $  sed -i 's/^.*\.//g' input.txt


Gives:

HQ385479

ATGAGAGTGATGGAGACATGGATG-------------ATTTGCAAAATTG
G------TGG---------------------------AGAGGGGGTCTC

0
Entering edit mode

Thanks! Is there a way within the sed command to remove all the nucleotide sequences as well so I'm just left with all the accession numbers? This is my first time doing any bioinformatics and I am still learning the whole programming/coding side of it all.

0
Entering edit mode

You can use this:

awk -F. 'NF>1{print $NF}' input.txt > output.txt  ADD REPLY 1 Entering edit mode For future reference: hightlight the text you want to format as code and then click on the "101" button in the edit window to apply the formatting. ADD REPLY 0 Entering edit mode Is your file a fasta formatted file? (Header lines beginning with >)? Or is it exactly as you posted above? ADD REPLY 0 Entering edit mode It begins with > Didn't copy in correctly ADD REPLY 1 Entering edit mode I would just chain it to grep personally, but now the solution is getting a bit less elegant. cat input.txt | grep ">" | sed 's/^.*\.//g'  ADD REPLY 0 Entering edit mode very minor change to code. Please add > as replacement. So that sequence is still in fasta format. $ sed -e 's/^.*\./>/g' test1.fa
>HQ385479

ATGAGAGTGATGGAGACATGGATG-------------ATTTGCAAAATTG
G------TGG---------------------------AGAGGGGGTCTC

0
Entering edit mode

The OP said he doesn't want the sequence, just the accession itself (I assuming they're making a list for a table or similar), so there's no need to sub in the ">".

0
Entering edit mode

okay. didn't read OP in full :)

0
Entering edit mode
4.2 years ago
bk11 ▴ 50
 awk -F. 'NF>1{print \$NF}' input.txt