How to extract two genomic location numbers within the following fasta header?

0

Entering edit mode

2.6 years ago

mrj ▴ 170

I am wondering how to extract the two numbers within the location tab of the following fasta header.

>lcl|CP033719.1_cds_AYW77996.1_1542 [locus_tag=EGX94_07890] [protein=copper oxidase] [protein_id=AYW77996.1] [location=1885267..1887939] [gbkey=CDS]

fasta extract location genomic bash • 977 views

ADD COMMENT • link 2.6 years ago by mrj ▴ 170

1

Entering edit mode

$ awk -F 'location=|]|[.]{2}' '/^>/ {print $5,$6}' test.fa
$ sed -rn '/^>/ s/(.*location=)([0-9]+)..([0-9]+)].*$/\2\t\3/p' test.fa
$ grep -Po "(?<=location=).*(?=]\s.*)" test.fa | tr -s '.' '\t'
$ seqkit replace -p '.*location=(.*)]\s.*' -r '${1}' test.fa | seqkit seq -n  | sed -r 's/\.{2}/\t/'

ADD REPLY • link 2.6 years ago by cpad0112 21k

0

Entering edit mode

Thank you so much for this solution. It works for me. I am learning a lot from your solution.

ADD REPLY • link 2.6 years ago by mrj ▴ 170

1

Entering edit mode

In Python

Suppose your header is saved in header variable

header.partition("location=")[2].partition("]")[0].split('..')

This will return list ['1885267', '1887939'] which you can easily manipulate

It will only work if it finds a location keyword, otherwise, it will return an empty list

ADD REPLY • link 2.6 years ago by Renesh ★ 2.2k

0

Entering edit mode

Hello Renesh, Thanks. This is much more similar and does the task perfectly.

Thank you so much.

ADD REPLY • link 2.6 years ago by mrj ▴ 170

Login before adding your answer.