Question

extracting fasta sequence from fasta file with multiple genomes

0

Entering edit mode

2.7 years ago

ss3943 • 0

I have a fasta file with 100+ viral genomes. I am only interested in looking at the sequence for one particular gene, for which I have the coordinates. I have tried bedtools getfasta tools as follows:

head UL48.bed
JQ673480.1      103538  105080

head ref.fa
>KT425109.1 
ATAAACCAACGAAAAGCGCGGGAACGGGG....

bedtools getfasta -fi ref.fa -bed UL48.bed

I realize that the issue is the bedfile chromosome does not match that found in the ref.fa file. Each genome only has a unique identifier, however, so they will never match, and it would be extremely time-intensive to manually make a bedfile with each of the unique identifiers. I don't want to edit my ref.fa file to have uniform identifiers, however, because I need this information for downstream processes. Is there any way to use grep or a similar command line tool to do this? So, I need to grep each line beginning with > and the characters ~103538 to ~105080 from each entry in the ref.fa file.

bedtools fasta grep • 1.6k views

ADD COMMENT • link updated 2.7 years ago by cpad0112 21k • written 2.7 years ago by ss3943 • 0

0

Entering edit mode

You can easily grep out the fasta headers and then create a bed file with those.

$ grep "^>" test.fa
>test1
>test2
>test3
>test4

$ grep "^>" test.fa | sed 's/>//g' | awk '{OFS="\t"}{print $1,103538,105080}'
test1   103538  105080
test2   103538  105080
test3   103538  105080
test4   103538  105080

After you get a multiple fasta file with sequences you can split the files into individual (if you need to) using: How to split fasta by '>' into a file each containing one sequence, and have the name of that file be the ID?

ADD REPLY • link 2.7 years ago by GenoMax 141k

score 1 · Answer 1 · 2021-07-26

Are you sure your genomes are exactly the same length without any insertion/deletion messing up the coordinates?

If yes some Perl code like this will work:

#!/usr/bin/perl

use strict;
use warnings;

my $start = 103538;
my $end = 105080;

$/="\n>"; # read Fasta sequences in blocks
while (<>) {
    s/>//g;
    my ($seq_id, @seq) = split(/\n/, $_);
    my $seq = join "", @seq;
    my $subseq = substr($seq, $start - 1, $end - $start);
    print ">$seq_id\n$subseq\n";
}

Usage:

perl getRegion.pl < INPUT_FASTA > OUTPUT_FASTA

score 0 · Answer 2 · 2021-07-26

0

Entering edit mode

2.7 years ago

Divon ▴ 230

This is trivial to do with Genozip:

genozip test.fa 
genocat test.fa.genozip -r JQ673480.1 --no-header --sequential | cut -c 103538-105080

Documentation: https://genozip.com

Paper: https://www.researchgate.net/publication/349347156_Genozip_-_A_Universal_Extensible_Genomic_Data_Compressor

ADD COMMENT • link 2.7 years ago by Divon ▴ 230

score 0 · Answer 3 · 2021-07-27

see if this works with awk (assuming that fasta sequence records are in single line and each fasta record has the the required sequence within it):

$ awk -v OFS="\n" '/>/ {getline seq} {print $0, substr(seq,103538,1543)}' *.fa > new.fasta

or

$ awk '/>/; !/>/ {print substr(seq,103538,1543)}' *.fa > new.fasta

If fasta sequences are not in single line, try seqkit:

$ seqkit --quiet -w 0 subseq -r 103538:105080 *.fa > new.fasta

if you want to append gene name at the end of all fasta headers, do this:

$ seqkit --quiet -w 0 subseq -r 103538:105080 *.fa   | seqkit replace  -p "$" -r "_JQ673480.1" > new.fasta