I am demonstrating the extraction for two sequences below. This can be automated for multiple using for loops etc. Should be feasible with protein sequences as well.
Here are your genomes (only fasta headers shown)
$ grep "^>" sequences.fasta
>LC542809 |Severe acute respiratory syndrome coronavirus 2 TKYE6947_2020 RNA| complete genome
>MT114412 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/HKG/HKU-902a/2020| complete genome
>MT114413 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/HKG/HKU-902b/2020| complete genome
Nucleocapsid phosphoprotein (or any other ORF you like, truncated sequence)
$ more nuc.fa
>NC_045512.2:28274-29533
ATGTCTGATAATGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAA
CTGGCAGTAACCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAA
TAATACTGCGTCTTGGTTCACCGCTCTCACTCAACATGGCAAGGAAGACCTTAAATTCCCTCGAGGACAA
GGCGTTCCAATTAACACCAATAGCAGTCCAGATGACCAAATTGGCTACTACCGAAGAGCTACCAGACGAA
Make the blast database
$ makeblastdb -in sequences.fasta -dbtype nucl -parse_seqids
Search using your gene
$ blastn -query nuc.fa -db sequences.fasta -out results.out -outfmt 6
Here is what results look like
$ more results.out
NC_045512.2:28274-29533 LC542809 100.000 1260 0 0 1 1260 28274 29533 0.0 2327
NC_045512.2:28274-29533 MT114413 99.921 1260 1 0 1 1260 28274 29533 0.0 2322
NC_045512.2:28274-29533 MT114412 99.921 1260 1 0 1 1260 28272 29531 0.0 2322
You need to get hit accession and start-stop
$ awk -F "\t" '{OFS="\t"}{print $2,$9,$10}' results.out
LC542809 28274 29533
MT114413 28274 29533
MT114412 28272 29531
Extract sequence in range identified from result above (sequences truncated)
$ blastdbcmd -db sequences.fasta -entry MT114413 -range 28274-29533
>MT114413:28274-29533 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/HKG/HKU-902b/2020| complete genome
ATGTCTGATAATGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAA
CCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGCGTCTTGGTTCA
CCGCTCTCACTCAACATGGCAAGGAAGACCTTAAATTCCCTCGAGGACAAGGCGTTCCAATTAACACCAATAGCAGTCCA
GATGACCAAATTGGCTACTACCGAAGAGCTACCAGACGAATTCGTGGTGGTGACGGTAAAATGAAAGATCTCAGTCCAAG
ATGGTATTTCTACTACCTAGGAACTGGGCCAGAAGCTGGACTTCCCTATGGTGCTAACAAAGACGGCATCATATGGGTTG
$ blastdbcmd -db sequences.fasta -entry MT114412 -range 28272-29531
>MT114412:28272-29531 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/HKG/HKU-902a/2020| complete genome
ATGTCTGATAATGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAA
CCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGCGTCTTGGTTCA
CCGCTCTCACTCAACATGGCAAGGAAGACCTTAAATTCCCTCGAGGACAAGGCGTTCCAATTAACACCAATAGCAGTCCA
This should be easy to do using
blat/blast+
(of your genomes) and choosing a tabular output format. Then extracting the top hit (which should be full length gene, since these are homologous genomes) usingsamtools faidx
,blastdbcmd -range
etc.Not sure if this will work but you could use
seqkit grep
with your input gene.Why not just get the protein from a genbank record which is already annotated for you?
OP wants to get the region/protein from 10K+ GSAID genomes.
Right, but they can just download a 'reference' for the protein coding region from one of the annotated genomes and use that in GISAID?