Question: common sequence between multi fasta file
0
gravatar for Sam
13 months ago by
Sam100
Sam100 wrote:

Dear All

how I can retrieve common seqs between multiple fasta (12 ) files?

Thanks

bash • 650 views
ADD COMMENTlink modified 13 months ago • written 13 months ago by Sam100
1

I already obtain common seqs with this command , is it a correct way ?

comm -12 <(comm -12 <(comm -12 <(comm -12 <(comm -12 <(comm -12 <(comm -12 <(comm -12 <(comm -12 <(comm -12 <(comm -12 <(sort A1_novel_mirnas.fa) <(sort A2_novel_mirnas.fa)) <(sort B1_novel_mirnas.fa)) <(sort B2_novel_mirnas.fa)) <( sort C1_novel_mirnas.fa)) <(sort C2_novel_mirnas.fa)) <(sort D1_novel_mirnas.fa)) <(sort D2_novel_mirnas.fa)) <(sort E1_novel_mirnas.fa)) <(sort E2_novel_mirnas.fa)) <(sort F1_novel_mirnas.fa)) <(sort F2_novel_mirnas.fa)
ADD REPLYlink modified 10 months ago • written 13 months ago by Sam100

So you have RNA sequence. Can you try replacing U with T and running @5heikki's script again?

ADD REPLYlink written 13 months ago by genomax65k

Hello Sam,

what do you mean by "common seqs". Please describe more clearly what you have and what your goal is.

fin swimmer

ADD REPLYlink written 13 months ago by finswimmer11k

Hi

Common mean sequence which is available in all fasta file. I want to find common sequences between all 12 fasta files.

Thanks in advance

ADD REPLYlink modified 13 months ago • written 13 months ago by Sam100

Are there many sequences in each file or just one per file?

ADD REPLYlink written 13 months ago by genomax65k

about 50 in each file

ADD REPLYlink written 13 months ago by Sam100
1

Doing quick blat searches is one option that should be pretty fast. Other option may be CD-HIT to do pair-wise comparisons.

ADD REPLYlink written 13 months ago by genomax65k

The unix commandline utility comm will be useful here, so long as you can sort your files first. (Assuming ‘common’ in this question means the sequences or headers are identical)

ADD REPLYlink written 13 months ago by jrj.healey12k
  1. make a list of the unique fasta sequences

    cat *.mfa | grep -v ">" | sort -u > unique_sequences

  2. loop each line in all .mfa file

    cat unique_sequences | while read line; do grep -w $line *.mfa; done

  3. you could easily go further and check which sequence is present 12 times

ADD REPLYlink modified 13 months ago • written 13 months ago by lessismore610
1
gravatar for 5heikki
13 months ago by
5heikki8.4k
Finland
5heikki8.4k wrote:

This assumes that 1) blast is installed and that 2) the script is run in the dir where .fasta named files reside

#!/bin/bash
FILES=($(find . -maxdepth 1 -type f -name "*.fasta"))
for((i=0;i<${#FILES[@]}-1;i++)); do
    blastn -query ${FILES[$i]} -subject ${FILES[$i+1]} -outfmt '6 qseqid qlen length slen qseq' \
        | awk 'BEGIN{OFS=FS="\t"}{if($2==$3 && $2==$4){print ">"$1"\n"$5}}' > tmp.fasta
    cp tmp.fasta ${FILES[$i+1]}
done

So we blast first file against second file and put into tmp.fasta only the identical sequences between the files. Then we replace the second file with tmp.fasta. Then we blast that against the third file. Etc. Files will be deleted. If there are common sequences, in the end they will be in the tmp.fasta file. Here we assume that if query length = alignment length = subject length the sequences are the same. Of course, this might not be the case. I leave it for you how to make it more precise blastn -help

ADD COMMENTlink written 13 months ago by 5heikki8.4k

Did you mean to use blast2seq? I don't see blast indexes being made anywhere.

Edit: That is the correct syntax for blast2seq (not having used it on the command line made me check again).

@sam: If you are getting an error then post the exact error or explain what "is not working".

ADD REPLYlink modified 13 months ago • written 13 months ago by genomax65k

unfortunately it's not working and as Genomax mentioned for blastn we need index , which is missing

ADD REPLYlink modified 13 months ago • written 13 months ago by Sam100

If your file names don't end in .fasta then you need to change the code above to account for that.

ADD REPLYlink written 13 months ago by genomax65k

Warning: [blastn] Query is Empty! and it convert all fasta file to empty file!

ADD REPLYlink written 13 months ago by Sam100

I just tested @5heikki's script and it worked flawlessly.

Do all your files have consistent names/format?

ADD REPLYlink written 13 months ago by genomax65k

If the sequences are really short add -task blastn-short to the command. Also, is it possible that you don't have any "universal" seqs?

ADD REPLYlink modified 13 months ago • written 13 months ago by 5heikki8.4k

the length of seqs are about 20 bp , how ever I tested with blastn-short and it produced a huge temp file over than 600 Mb from just 600 seqs (total amount of seqs from all files is about 600 ) . when I lunched it with just 4 lib it was fine but for all 12 libs it's not work correctly

ADD REPLYlink modified 13 months ago • written 13 months ago by Sam100

Check from blastn -help output how to add a field to outfmt for sequence similarity and filter also based on that in the awk command. In case the "temp" file is still huge, you must have a lot of redundancy in your input.

ADD REPLYlink written 13 months ago by 5heikki8.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1361 users visited in the last hour