Question

Extracting A Region in Nucleotide FASTA Based on Reference FASTA File

0

Entering edit mode

4.1 years ago

Aizan Fahri • 0

Hey Biostar!

I hope everyone is staying healthy and safe.

I am now interested to look at the Nucleocapsid (N) protein of the SARS-CoV-2 and to compare it with other human coronaviruses. I registered my access at GISAID and I was so surprised to know that there are so many isolates already (more than 10,000 records uploaded!).

I studied flu before and flu is relatively easier to study because it has segmented genome. However, coronaviruses don't have segmented genome and trying to extract the protein-coding region has been a challenge. I am curious if there is any tool that would allow me to do this: I feed it a FASTA file containing records of SARS-CoV-2 genome (> 29,000 nts), extract a specific region based on a reference FASTA file containing gene of interest (i.e. the N gene), and then output the extracted region. For example:

awesome_tool --reference N_ref.fasta --input cov-2019.fasta --out extracted-N.fasta

The N_ref.fasta is a normal nucleotide fasta file, like so:

>SARS-CoV-2|N|412AA
AAAAAAAAAAAAAAAAAAAAAAAAAAAA

I did try NCBI ORF Finder. Great tool, but I cannot automate it to iterate over a large number of records.

Thank you!

sequence gene • 1.5k views

ADD COMMENT • link updated 4.1 years ago by GenoMax 142k • written 4.1 years ago by Aizan Fahri • 0

1

Entering edit mode

This should be easy to do using blat/blast+ (of your genomes) and choosing a tabular output format. Then extracting the top hit (which should be full length gene, since these are homologous genomes) using samtools faidx, blastdbcmd -range etc.

Not sure if this will work but you could use seqkit grep with your input gene.

ADD REPLY • link 4.1 years ago by GenoMax 142k

0

Entering edit mode

Why not just get the protein from a genbank record which is already annotated for you?

ADD REPLY • link 4.1 years ago by Joe 21k

1

Entering edit mode

OP wants to get the region/protein from 10K+ GSAID genomes.

ADD REPLY • link 4.1 years ago by GenoMax 142k

0

Entering edit mode

Right, but they can just download a 'reference' for the protein coding region from one of the annotated genomes and use that in GISAID?

ADD REPLY • link 4.1 years ago by Joe 21k

score 8 · Accepted Answer · 2020-04-22

I am demonstrating the extraction for two sequences below. This can be automated for multiple using for loops etc. Should be feasible with protein sequences as well.

Here are your genomes (only fasta headers shown)

$ grep "^>" sequences.fasta
>LC542809 |Severe acute respiratory syndrome coronavirus 2 TKYE6947_2020 RNA| complete genome
>MT114412 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/HKG/HKU-902a/2020| complete genome
>MT114413 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/HKG/HKU-902b/2020| complete genome

Nucleocapsid phosphoprotein (or any other ORF you like, truncated sequence)

$ more nuc.fa
>NC_045512.2:28274-29533
ATGTCTGATAATGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAA
CTGGCAGTAACCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAA
TAATACTGCGTCTTGGTTCACCGCTCTCACTCAACATGGCAAGGAAGACCTTAAATTCCCTCGAGGACAA
GGCGTTCCAATTAACACCAATAGCAGTCCAGATGACCAAATTGGCTACTACCGAAGAGCTACCAGACGAA

Make the blast database

$ makeblastdb -in sequences.fasta -dbtype nucl -parse_seqids

Search using your gene

$ blastn -query nuc.fa -db sequences.fasta -out results.out -outfmt 6

Here is what results look like

$ more results.out
NC_045512.2:28274-29533 LC542809        100.000 1260    0       0       1       1260    28274   29533   0.0   2327
NC_045512.2:28274-29533 MT114413        99.921  1260    1       0       1       1260    28274   29533   0.0   2322
NC_045512.2:28274-29533 MT114412        99.921  1260    1       0       1       1260    28272   29531   0.0   2322

You need to get hit accession and start-stop

$ awk -F "\t" '{OFS="\t"}{print $2,$9,$10}' results.out
LC542809    28274   29533
MT114413    28274   29533
MT114412    28272   29531

Extract sequence in range identified from result above (sequences truncated)

$ blastdbcmd -db sequences.fasta -entry MT114413 -range 28274-29533
>MT114413:28274-29533 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/HKG/HKU-902b/2020| complete genome
ATGTCTGATAATGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAA
CCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGCGTCTTGGTTCA
CCGCTCTCACTCAACATGGCAAGGAAGACCTTAAATTCCCTCGAGGACAAGGCGTTCCAATTAACACCAATAGCAGTCCA
GATGACCAAATTGGCTACTACCGAAGAGCTACCAGACGAATTCGTGGTGGTGACGGTAAAATGAAAGATCTCAGTCCAAG
ATGGTATTTCTACTACCTAGGAACTGGGCCAGAAGCTGGACTTCCCTATGGTGCTAACAAAGACGGCATCATATGGGTTG

$ blastdbcmd -db sequences.fasta -entry MT114412 -range 28272-29531
>MT114412:28272-29531 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/HKG/HKU-902a/2020| complete genome
ATGTCTGATAATGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAA
CCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGCGTCTTGGTTCA
CCGCTCTCACTCAACATGGCAAGGAAGACCTTAAATTCCCTCGAGGACAAGGCGTTCCAATTAACACCAATAGCAGTCCA