Question: find difference sequences with different header
0
gravatar for Jason
5 months ago by
Jason0
Jason0 wrote:

I have two fasta file with 200 sequences. I want to use shell commands to find difference sequences with different headers between theses two fasta files and save that in new file with heder from file as output.

this command find match sequences even if header is not same. it look at match sequences and skip header.

seqkit common --by-seq --ignore-case file1.fasta file2.fasta file3.fasta > out.fasta

I want to find difference sequences from file 1 not find in file 2 and save that in file 3. we want to compare by sequences not header becasue both file have different header but some sequences same and other are different.

file1:
>NP_000009.1 very long-chain specific acyl-CoA dehydrogenase, mitochondrial isoform 1 precursor [Homo sapiens]
MQAARMAASLGRQLLRLGGGSSRLTALLGQPRPGPARRPYAGGAAQLALDKSDSHPSDALTRKKPAKAES
KSFAVGMFKGQLTTDQVFPYPSVLNEEQTQFLKELVEPVSRFFEEVNDPAKNDALEMVEETTWQGLKELG
AFGLQVPSELGGVGLCNTQYARLVEIVGMHDLGVGITLGAHQSIGFKGILLFGTKAQKEKYLPKLASGET
VAAFCLTEPSSGSDAASIRTSAVPSPCGKYYTLNGSKLWISNGGLADIFTVFAKTPVTDPATGAVKEKIT
AFVVERGFGGITHGPPEKKMGIKASNTAEVFFDGVRVPSENVLGEVGSGFKVAMHILNNGRFGMAAALAG
TMRGIIAKAVDHATNRTQFGEKIHNFGLIQEKLARMVMLQYVTESMAYMVSANMDQGATDFQIEAAISKI
FGSEAAWKVTDECIQIMGGMGFMKEPGVERVLRDLRIFRIFEGTNDILRLFVALQGCMDKGKELSGLGSA
LKNPFGNAGLLLGEAGKQLRRRAGLGSGLSLSGLVHPELSRSGELAVRALEQFATVVEAKLIKHKKGIVN
EQFLLQRLADGAIDLYAMVVVLSRASRSLSEGHPTAQHEKMLCDTWCIEAAARIREGMAALQSDPWQQEL
YRNFKSISKALVERGGVVTSNPLGF


>NP_000010.1 acetyl-CoA acetyltransferase, mitochondrial precursor [Homo sapiens]
MAVLAALLRSGARSRSPLLRRLVQEIRYVERSYVSKPTLKEVVIVSATRTPIGSFLGSLSLLPATKLGSI
AIQGAIEKAGIPKEEVKEAYMGNVLQGGEGQAPTRQAVLGAGLPISTPCTTINKVCASGMKAIMMASQSL
MCGHQDVMVAGGMESMSNVPYVMNRGSTPYGGVKLEDLIVKDGLTDVYNKIHMGSCAENTAKKLNIARNE
QDAYAINSYTRSKAAWEAGKFGNEVIPVTVTVKGQPDVVVKEDEEYKRVDFSKVPKLKTVFQKENGTVTA
ANASTLNDGAAALVLMTADAAKRLNVTPLARIVAFADAAVEPIDFPIAPVYAASMVLKDVGLKKEDIAMW
EVNEAFSLVVLANIKMLEIDPQKVNINGGAVSLGHPIGMSGARIVGHLTHALKQGEYGLASICNGGGGAS
AMLIQKL


file2:
>sp|Q8R519|ACMSD_MOUSE 2-amino-3-carboxymuconate-6-semialdehyde decarboxylase OS=Mus musculus GN=Acmsd PE=1 SV=2
MKIDIHTHILPKEWPDLEKRFGYGGWVQLQQQGKGEAKMIKDGKLFRVIQQNCWDPEVRI
REMNQKGVTVQALSTVPVMFSYWAKPKDTLELCQFLNNDLAATVARYPRRFVGLGTLPMQ
APELAVEEMERCVKALGFPGIQIGSHINTWDLNDPELFPIYAAAERLNCSLFVHPWDMQM
DGRMAKYWLPWLVGMPSETTMAICSMIMGGVFEKFPKLKVCFAHGGGAFPFTIGRIAHGF
NMRPDLCAQDNPSDPRKYLGSFYTDSLVHDPLSLKLLTDVIGKDKVMLGTDYPFPLGEQE
PGKLIESMAEFDEETKDKLTAGNALAFLGLERKLFE

>sp|P35738|ODBB_RAT 2-oxoisovalerate dehydrogenase subunit beta, mitochondrial OS=Rattus norvegicus GN=Bckdhb PE=1 SV=3
MQAARMAASLGRQLLRLGGGSSRLTALLGQPRPGPARRPYAGGAAQLALDKSDSHPSDALTRKKPAKAES
KSFAVGMFKGQLTTDQVFPYPSVLNEEQTQFLKELVEPVSRFFEEVNDPAKNDALEMVEETTWQGLKELG
AFGLQVPSELGGVGLCNTQYARLVEIVGMHDLGVGITLGAHQSIGFKGILLFGTKAQKEKYLPKLASGET
VAAFCLTEPSSGSDAASIRTSAVPSPCGKYYTLNGSKLWISNGGLADIFTVFAKTPVTDPATGAVKEKIT
AFVVERGFGGITHGPPEKKMGIKASNTAEVFFDGVRVPSENVLGEVGSGFKVAMHILNNGRFGMAAALAG
TMRGIIAKAVDHATNRTQFGEKIHNFGLIQEKLARMVMLQYVTESMAYMVSANMDQGATDFQIEAAISKI
FGSEAAWKVTDECIQIMGGMGFMKEPGVERVLRDLRIFRIFEGTNDILRLFVALQGCMDKGKELSGLGSA
LKNPFGNAGLLLGEAGKQLRRRAGLGSGLSLSGLVHPELSRSGELAVRALEQFATVVEAKLIKHKKGIVN
EQFLLQRLADGAIDLYAMVVVLSRASRSLSEGHPTAQHEKMLCDTWCIEAAARIREGMAALQSDPWQQEL
YRNFKSISKALVERGGVVTSNPLGF

>sp|P26149|3BHS2_MOUSE 3 beta-hydroxysteroid dehydrogenase/Delta 5-->4-isomerase type 2 OS=Mus musculus GN=Hsd3b2 PE=1 SV=4
MPGWSCLVTGAGGFLGQRIIQLLVQEEDLEEIRVLDKVFRPETRKEFFNLETSIKVTVLE
GDILDTQYLRRACQGISVVIHTAAIIDVTGVIPRQTILDVNLKGTQNLLEACIQASVPAF
IFSSSVDVAGPNSYKEIVLNGHEEECHESTWSDPYPYSKKMAEKAVLAANGSMLKNGGTL
QTCALRPMCIYGERSPLISNIIIMALKHKGILRSFGKFNTANPVYVGNVAWAHILAARGL
RDPKKSPNIQGEFYYISDDTPHQSFDDISYTLSKEWGFCLDSSWSLPVPLLYWLAFLLET
VSFLLSPIYRYIPPFNRHLVTLSGSTFTFSYKKAQRDLGYEPLVSWEEAKQKTSEWIGTL
VEQHRETLDTKSQ




result file 3

>NP_000010.1 acetyl-CoA acetyltransferase, mitochondrial precursor [Homo sapiens]
MAVLAALLRSGARSRSPLLRRLVQEIRYVERSYVSKPTLKEVVIVSATRTPIGSFLGSLSLLPATKLGSI
AIQGAIEKAGIPKEEVKEAYMGNVLQGGEGQAPTRQAVLGAGLPISTPCTTINKVCASGMKAIMMASQSL
MCGHQDVMVAGGMESMSNVPYVMNRGSTPYGGVKLEDLIVKDGLTDVYNKIHMGSCAENTAKKLNIARNE
QDAYAINSYTRSKAAWEAGKFGNEVIPVTVTVKGQPDVVVKEDEEYKRVDFSKVPKLKTVFQKENGTVTA
ANASTLNDGAAALVLMTADAAKRLNVTPLARIVAFADAAVEPIDFPIAPVYAASMVLKDVGLKKEDIAMW
EVNEAFSLVVLANIKMLEIDPQKVNINGGAVSLGHPIGMSGARIVGHLTHALKQGEYGLASICNGGGGAS
AMLIQKL
sequence • 361 views
ADD COMMENTlink modified 5 months ago by cpad01129.0k • written 5 months ago by Jason0
2
gravatar for finswimmer
5 months ago by
finswimmer5.4k
Germany
finswimmer5.4k wrote:

Hello,

what about:

cat file1.fasta file2.fasta|seqkit rmdup -s -o file3.fa

fin swimmer

EDIT:

Another way is to first find all sequences they have in common, extract there IDs and than look in your first file for all sequences that do not match these IDs:

Find common sequences between file1 and file2 and extract ther IDs:

seqkit common -s file1.fa file2.fa|grep '>'|cut -c2- > common_ids

Get all sequences from file1 that do not match the IDs in common_ids and store the result in file3.fa:

seqkit grep file1.fa -v -n -f common_ids -o file3.fa
ADD COMMENTlink modified 5 months ago • written 5 months ago by finswimmer5.4k

Hey,

When I used this command it gave me this error and i'm sure about the name of files: cat file1.fasta file2.fasta|seqkit rmdup seqkit rmdup -s -o file3.fa

[ERRO] fastx: open seqkit: no such file or directory

How can I resolve this problem?

If I change the command to this command cat file1.fasta file2.fasta|seqkit rmdup -s -o file3.fasta [INFO] 3 duplicated records removed

It will remove duplicate files and will not save the difference in file 3

thanks

ADD REPLYlink written 5 months ago by Jason0

Oops, I copied the command twice. See me updated one.

ADD REPLYlink written 5 months ago by finswimmer5.4k

Hey, I applied this code

cat file1.fasta file2.fasta|seqkit rmdup -s -o file3.fa

the result saved in file3 as :

>NP_000009.1 very long-chain specific acyl-CoA dehydrogenase, mitochondrial isoform 1 precursor [Homo sapiens] MQAARMAASLGRQLLRLGGGSSRLTALLGQPRPGPARRPYAGGAAQLALDKSDSHPSDALTRKKPAKAESKSFAVGMFKGQLTTDQVFPYPSVLNEEQTQFLKELVEPVSRFFEEVNDPAKNDALEMVEETTWQGLKELG AFGLQVPSELGGVGLCNTQYARLVEIVGMHDLGVGITLGAHQSIGFKGILLFGTKAQKEKYLPKLASGETVAAFCLTEPSSGSDAASIRTSAVPSPCGKYYTLNGSKLWISNGGLADIFTVFAKTPVTDPATGAVKEKIT AFVVERGFGGITHGPPEKKMGIKASNTAEVFFDGVRVPSENVLGEVGSGFKVAMHILNNGRFGMAAALAGTMRGIIAKAVDHATNRTQFGEKIHNFGLIQEKLARMVMLQYVTESMAYMVSANMDQGATDFQIEAAISKI FGSEAAWKVTDECIQIMGGMGFMKEPGVERVLRDLRIFRIFEGTNDILRLFVALQGCMDKGKELSGLGSALKNPFGNAGLLLGEAGKQLRRRAGLGSGLSLSGLVHPELSRSGELAVRALEQFATVVEAKLIKHKKGIVN EQFLLQRLADGAIDLYAMVVVLSRASRSLSEGHPTAQHEKMLCDTWCIEAAARIREGMAALQSDPWQQELYRNFKSISKALVERGGVVTSNPLGF

This is not what I expected because this sequence found in both file 1 and file 2. I want to save sequence find in file1 but not find in file 2

the expect result should be like this in file 3

>NP_000010.1 acetyl-CoA acetyltransferase, mitochondrial precursor [Homo sapiens] MAVLAALLRSGARSRSPLLRRLVQEIRYVERSYVSKPTLKEVVIVSATRTPIGSFLGSLSLLPATKLGSI AIQGAIEKAGIPKEEVKEAYMGNVLQGGEGQAPTRQAVLGAGLPISTPCTTINKVCASGMKAIMMASQSLMCGHQDVMVAGGMESMSNVPYVMNRGSTPYGGVKLEDLIVKDGLTDVYNKIHMGSCAENTAKKLNIARNE QDAYAINSYTRSKAAWEAGKFGNEVIPVTVTVKGQPDVVVKEDEEYKRVDFSKVPKLKTVFQKENGTVTAANASTLNDGAAALVLMTADAAKRLNVTPLARIVAFADAAVEPIDFPIAPVYAASMVLKDVGLKKEDIAMW EVNEAFSLVVLANIKMLEIDPQKVNINGGAVSLGHPIGMSGARIVGHLTHALKQGEYGLASICNGGGGASAMLIQKL

thanks

ADD REPLYlink modified 5 months ago by genomax55k • written 5 months ago by Jason0

Hm, strange. You're right. I think this shouldn't happen.

[I moved my answer as an edit in the first post.]

fin swimmer

ADD REPLYlink modified 5 months ago • written 5 months ago by finswimmer5.4k

Thank you so much for your hep

ADD REPLYlink written 5 months ago by Jason0

Fine if I could help you.

I moved my answers as an edit to my first post. So if you think this is your solution you can mark this post as accepted.

fin swimmer

ADD REPLYlink written 5 months ago by finswimmer5.4k

Hello,

After I made some analysis for many sequences I found this code will hold some common sequences in both file 1 and file 2 and save that in file 3

Find common sequences between file1 and file2 and extract ther IDs: seqkit common -s file1.fa file2.fa|grep '>'|cut -c2- > common_ids

Get all sequences from file1 that do not match the IDs in common_ids and store the result in file3.fa: seqkit grep file1.fa -v -n -f common_ids -o file3.fa

We get all sequences from file1 that do not match the IDs in common_ids and store the result in file3.fa, but the problem here some sequences in file 2 have different index but same identical sequences. That mean some sequences from file 2 will be in result3. I want to save in file 3 all sequences from file 1 not find in file 2.

ADD REPLYlink written 5 months ago by Jason0

Hello Jason,

please use the code tag to mark code within your post. It is much more readable.

I'm confused with your answer. seqkit common -s file1.fa file2.fa is doing a comparison by sequence (ignoring differences in the ID) and give you all sequences back that are in both files. These sequences get the ID from the first file. With grep and cut we extract these IDs to remove those sequences from file1 in the next step leaving behind only sequences that are in file1 but not in file2.

fin swimmer

ADD REPLYlink written 5 months ago by finswimmer5.4k
0
gravatar for Pierre Lindenbaum
5 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum112k wrote:

linearize both files, sort and comm on sequence:

comm -3 \
      <(awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' f1.fasta | cut -f 2 | sort)  \
      <(awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' f2.fasta | cut -f 2 | sort)
ADD COMMENTlink modified 5 months ago • written 5 months ago by Pierre Lindenbaum112k

Hey, This code did not do what I expect it.

thanks

ADD REPLYlink written 5 months ago by Jason0
0
gravatar for cpad0112
5 months ago by
cpad01129.0k
India
cpad01129.0k wrote:

output:

   $ seqkit grep -svf <(seqkit seq -s -w 0 test2.fa) <(seqkit seq -w 0 test1.fa)

  >NP_000010.1 acetyl-CoA acetyltransferase, mitochondrial precursor [Homo sapiens]
    MAVLAALLRSGARSRSPLLRRLVQEIRYVERSYVSKPTLKEVVIVSATRTPIGSFLGSLS
    LLPATKLGSIAIQGAIEKAGIPKEEVKEAYMGNVLQGGEGQAPTRQAVLGAGLPISTPCT
    TINKVCASGMKAIMMASQSLMCGHQDVMVAGGMESMSNVPYVMNRGSTPYGGVKLEDLIV
    KDGLTDVYNKIHMGSCAENTAKKLNIARNEQDAYAINSYTRSKAAWEAGKFGNEVIPVTV
    TVKGQPDVVVKEDEEYKRVDFSKVPKLKTVFQKENGTVTAANASTLNDGAAALVLMTADA
    AKRLNVTPLARIVAFADAAVEPIDFPIAPVYAASMVLKDVGLKKEDIAMWEVNEAFSLVV
    LANIKMLEIDPQKVNINGGAVSLGHPIGMSGARIVGHLTHALKQGEYGLASICNGGGGAS
    AMLIQKL

input (not pasting the sequences):

$ seqkit seq -n test2.fa 
sp|Q8R519|ACMSD_MOUSE 2-amino-3-carboxymuconate-6-semialdehyde decarboxylase OS=Mus musculus GN=Acmsd PE=1 SV=2
sp|P35738|ODBB_RAT 2-oxoisovalerate dehydrogenase subunit beta, mitochondrial OS=Rattus norvegicus GN=Bckdhb PE=1 SV=3
sp|P26149|3BHS2_MOUSE 3 beta-hydroxysteroid dehydrogenase/Delta 5-->4-isomerase type 2 OS=Mus musculus GN=Hsd3b2 PE=1 SV=4

$ seqkit seq -n test1.fa 
NP_000009.1 very long-chain specific acyl-CoA dehydrogenase, mitochondrial isoform 1 precursor [Homo sapiens]
NP_000010.1 acetyl-CoA acetyltransferase, mitochondrial precursor [Homo sapiens]
ADD COMMENTlink modified 5 months ago • written 5 months ago by cpad01129.0k

Hi, This code will do same mistake because some sequences in file 2 have different id but same identical sequences in file 1.

We get all sequences from file1 that do not match the IDs in common_ids and store the result in file3.fa, but the problem here some sequences in file 2 have different index but same identical sequences. That mean some sequences from file 2 will be in result3. I want to save in file 3 all sequences from file 1 not find in file 2.

ADD REPLYlink written 5 months ago by Jason0

Well, I think this reply meant for finswimmer post above.

ADD REPLYlink written 5 months ago by cpad01129.0k

Hi, I also tested your code. It gave me some sequences find in both file 1 and file 2. it will not save sequences in file1 that not find in file 2.

The problem here some sequences in file 2 have different index but same identical sequences. That mean some sequences from file 2 will be in result. I want to save in file 3 all sequences from file 1 not find in file 2.

ADD REPLYlink written 5 months ago by Jason0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1407 users visited in the last hour