extracting fasta sequence from fasta file with multiple genomes
Entering edit mode
7 weeks ago
ss3943 • 0

I have a fasta file with 100+ viral genomes. I am only interested in looking at the sequence for one particular gene, for which I have the coordinates. I have tried bedtools getfasta tools as follows:

head UL48.bed
JQ673480.1      103538  105080

head ref.fa

bedtools getfasta -fi ref.fa -bed UL48.bed

I realize that the issue is the bedfile chromosome does not match that found in the ref.fa file. Each genome only has a unique identifier, however, so they will never match, and it would be extremely time-intensive to manually make a bedfile with each of the unique identifiers. I don't want to edit my ref.fa file to have uniform identifiers, however, because I need this information for downstream processes. Is there any way to use grep or a similar command line tool to do this? So, I need to grep each line beginning with > and the characters ~103538 to ~105080 from each entry in the ref.fa file.

bedtools fasta grep • 359 views
Entering edit mode

You can easily grep out the fasta headers and then create a bed file with those.

$ grep "^>" test.fa

$ grep "^>" test.fa | sed 's/>//g' | awk '{OFS="\t"}{print $1,103538,105080}'
test1   103538  105080
test2   103538  105080
test3   103538  105080
test4   103538  105080

After you get a multiple fasta file with sequences you can split the files into individual (if you need to) using: How to split fasta by '>' into a file each containing one sequence, and have the name of that file be the ID?

Entering edit mode
7 weeks ago
JC 12k

Are you sure your genomes are exactly the same length without any insertion/deletion messing up the coordinates?

If yes some Perl code like this will work:


use strict;
use warnings;

my $start = 103538;
my $end = 105080;

$/="\n>"; # read Fasta sequences in blocks
while (<>) {
    my ($seq_id, @seq) = split(/\n/, $_);
    my $seq = join "", @seq;
    my $subseq = substr($seq, $start - 1, $end - $start);
    print ">$seq_id\n$subseq\n";


perl getRegion.pl < INPUT_FASTA > OUTPUT_FASTA
Entering edit mode
7 weeks ago
Divon ▴ 40

This is trivial to do with Genozip:

genozip test.fa 
genocat test.fa.genozip -r JQ673480.1 --no-header --sequential | cut -c 103538-105080

Documentation: https://genozip.com

Paper: https://www.researchgate.net/publication/349347156_Genozip_-_A_Universal_Extensible_Genomic_Data_Compressor

Entering edit mode
7 weeks ago

see if this works with awk (assuming that fasta sequence records are in single line and each fasta record has the the required sequence within it):

$ awk -v OFS="\n" '/>/ {getline seq} {print $0, substr(seq,103538,1543)}' *.fa > new.fasta


$ awk '/>/; !/>/ {print substr(seq,103538,1543)}' *.fa > new.fasta

If fasta sequences are not in single line, try seqkit:

$ seqkit --quiet -w 0 subseq -r 103538:105080 *.fa > new.fasta

if you want to append gene name at the end of all fasta headers, do this:

$ seqkit --quiet -w 0 subseq -r 103538:105080 *.fa   | seqkit replace  -p "$" -r "_JQ673480.1" > new.fasta

Login before adding your answer.

Traffic: 1194 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6