extracting fasta sequence from fasta file with multiple genomes
Entering edit mode
2.2 years ago
ss3943 • 0

I have a fasta file with 100+ viral genomes. I am only interested in looking at the sequence for one particular gene, for which I have the coordinates. I have tried bedtools getfasta tools as follows:

head UL48.bed
JQ673480.1      103538  105080

head ref.fa

bedtools getfasta -fi ref.fa -bed UL48.bed

I realize that the issue is the bedfile chromosome does not match that found in the ref.fa file. Each genome only has a unique identifier, however, so they will never match, and it would be extremely time-intensive to manually make a bedfile with each of the unique identifiers. I don't want to edit my ref.fa file to have uniform identifiers, however, because I need this information for downstream processes. Is there any way to use grep or a similar command line tool to do this? So, I need to grep each line beginning with > and the characters ~103538 to ~105080 from each entry in the ref.fa file.

bedtools fasta grep • 1.3k views
Entering edit mode

You can easily grep out the fasta headers and then create a bed file with those.

$ grep "^>" test.fa

$ grep "^>" test.fa | sed 's/>//g' | awk '{OFS="\t"}{print $1,103538,105080}'
test1   103538  105080
test2   103538  105080
test3   103538  105080
test4   103538  105080

After you get a multiple fasta file with sequences you can split the files into individual (if you need to) using: How to split fasta by '>' into a file each containing one sequence, and have the name of that file be the ID?

Entering edit mode
2.2 years ago
JC 13k

Are you sure your genomes are exactly the same length without any insertion/deletion messing up the coordinates?

If yes some Perl code like this will work:


use strict;
use warnings;

my $start = 103538;
my $end = 105080;

$/="\n>"; # read Fasta sequences in blocks
while (<>) {
    my ($seq_id, @seq) = split(/\n/, $_);
    my $seq = join "", @seq;
    my $subseq = substr($seq, $start - 1, $end - $start);
    print ">$seq_id\n$subseq\n";


perl getRegion.pl < INPUT_FASTA > OUTPUT_FASTA
Entering edit mode
2.2 years ago
Divon ▴ 230

This is trivial to do with Genozip:

genozip test.fa 
genocat test.fa.genozip -r JQ673480.1 --no-header --sequential | cut -c 103538-105080

Documentation: https://genozip.com

Paper: https://www.researchgate.net/publication/349347156_Genozip_-_A_Universal_Extensible_Genomic_Data_Compressor

Entering edit mode
2.2 years ago

see if this works with awk (assuming that fasta sequence records are in single line and each fasta record has the the required sequence within it):

$ awk -v OFS="\n" '/>/ {getline seq} {print $0, substr(seq,103538,1543)}' *.fa > new.fasta


$ awk '/>/; !/>/ {print substr(seq,103538,1543)}' *.fa > new.fasta

If fasta sequences are not in single line, try seqkit:

$ seqkit --quiet -w 0 subseq -r 103538:105080 *.fa > new.fasta

if you want to append gene name at the end of all fasta headers, do this:

$ seqkit --quiet -w 0 subseq -r 103538:105080 *.fa   | seqkit replace  -p "$" -r "_JQ673480.1" > new.fasta

Login before adding your answer.

Traffic: 1201 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6