Question: merge two mulitfasta files and eliminate fasta with duplicated headers from the first
1
gravatar for Pawel Osipowski
2.7 years ago by
Poland, Warsaw
Pawel Osipowski20 wrote:

Hi, I have two multifasta files. I want to merge them deleting all those fasta seqences from the first multifasta file which are also in the second file. I need to do it by header comparison, sequences are different under the same headers.

Alternatively, could somebody give me a hint how to generate all the contigs (even those unchanged) through bcftools consensus?

Thanks, Pawel

ADD COMMENTlink modified 2.7 years ago by Jorge Amigo11k • written 2.7 years ago by Pawel Osipowski20
3
gravatar for Brian Bushnell
2.7 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

You can use the BBMap package like this:

filterbyname.sh in=file1.fasta names=file2.fasta exclude out=file1_filtered.fasta
cat file1_filtered.fasta file2.fasta > combined.fasta
ADD COMMENTlink written 2.7 years ago by Brian Bushnell16k

Hi Brian, first of all you've got my deep admiration for tools you produced. I'm using it since one year! When any paper is comming? I would cite with pleasure! Your method worked best because without any additional check repeatmasker swallowed converted first multifasta opposed to two other methods I tried!

ADD REPLYlink written 2.7 years ago by Pawel Osipowski20

Hi Pawel,

A paper on one of the tools should be submitted by the end of next week... I'll probably do some kind of short write-up of the suite overall soon, too, just to make it easier to cite.

-Brian

ADD REPLYlink written 2.7 years ago by Brian Bushnell16k
2
gravatar for Jorge Amigo
2.7 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

if command line is an option, here's a perl alternative:

cat file1.fasta file2.fasta | perl -ne '
if (/^>/) {
 $header = $_;
 delete $seqs{$header};
} else { $seqs{$header} .= $_ }
END {
foreach $header (keys %seqs) {
print $header.$seqs{$header};
}}'

any file placed secondly in the initial cat will overwrite previous sequences with equal header, as requested.

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Jorge Amigo11k

Great thanks Jorge, very neat solution. Do you think there might be some problems with ends of the last lines in files? Repeatmasker says some headers are too long but they shouldn't be. There is over 6k sequences and I'm slow at scripting.

ADD REPLYlink written 2.7 years ago by Pawel Osipowski20

it shouldn't have any problem with line endings since it doesn't remove them when acquiring the input. you may force a new line after each line with this alternative option:

cat file1.fasta file2.fasta | perl -ne '
/^(\S+)/ && $line = $1;
if ($line =~ /^>/) {
 $header = "$line\n";
 delete $seqs{$header};
} else { $seqs{$header} .= "$line\n" }
END {
 foreach $header (keys %seqs) {
 print $header.$seqs{$header};
}}'
ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Jorge Amigo11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1889 users visited in the last hour