Question: compare fasta files by sequence
0
gravatar for Chris
4.7 years ago by
Chris30
Chris30 wrote:

Hi, I have two fasta files with differetn headers. Sequences in file 2 partially match sequences in file1

I want to have a one-liner that will compare and find the partial match between the two files and report only sequences in file1.

Example of the two files and desired output.

thank you so much for your help in adavance

file1.fa

>chr5:111261537-111261740
AGAGTGCTAACATGCTAACATCAAATTCTATCCATCAGATCCAGCATCTACGGCTGAGTGCAATTCAACCTGAGGATGGCGCCCTATTCCGTGAAACCGTAGAGATTCTTTAGAAAATAGTCTAATAGAGGAGGTGGGATTGTACTTAGCCGTAGATGGTGGATCCGATGGTCAGGATTTGATGTTAGCA$
>chr5:121061210-121061338
GCTAGTTTCGGAGCCACAAAACCGGAGGGGATTAGAGGGGCTAAAATCCTCTCCTTATTTAAAATATTGAATAAGGAGGGGATTTTAGCCCCTCCAATCCCCTCTGGTATTTGGACGCCCAAACTAGC
>chr5:132306562-132306879
TGCACTTTCGGCCCTTAAACTTGTCTGGCTGTGTCAACCCGGTCCATAAACTTTTAAAGTATGTCTTTTAAGAACCTAATCTTTGTTCGGGTCTCTATTACGTCCTAACGGTTTGCTGTGGCGCACATGTCAATTGTCATGGTCTATTGGACCGCGTCCACCCTTCTCCCTCCTTCCTCGACGACTAACA$
>chr1:13506448-13506642
GAAACAATATCTTAGGGAGTGTTTGGTTTCTAGAGACTAATTTATAGTCCTTCCATTTTATTCCATTTTAGTTCTTAAATTACTAATTTCTGTATTTGACAGTTTAATTACTAAAATAGAATAAAATTGATGGATTAAAAATTAGTCCCTAGAAACAAAACACCGAGAACTGACAATAATCTAAGATACA$
>chr1:130644204-130644503
CAGTCTACAGGACTTGTCCGGTGCACCACCGGACAGCCAGGCGGGCCCACACGTCAGAGCTTCAACGGTCGGAACCCTACGACCTGGTGACGTGGCTGGCGCACCGGACTGTCCGGTGCACCATGCGACAGACAGCCTCCACCAAACAGCTAGTTTGGTGGAGGCTGTCTGTCGCATGGTGCACCGGACA$
>chr1:13560286-13560440
GCAGTTGAGGGGAATGTTGTCTGGTTGGAGACCTAACACCGCGAATTAATCATCCATGCCATGGAAGCAGCATATGCCCGCCTGCATCTATCCATGCATGATGGTGGAAGGTTTCGGACCAGGCTTCATTCCCCCCAACTCAACACTCAGCTGC
>chr5:160574908-160575134
GAGAGAGGGAGAGGGAGAAACTATGATATTTCATCAAGGCTAAAGGGAGAAGAGAGAGAGTACAGCCTTGTTGCCATGTACACACTCAGATATCAACAGTCTCTTATATTTGCTGTTACTACCTGTTATTAGTGTCCAGTGATGATGAAGGCTGTACCCTCTCTCTTCTTCTCTTAGCTTCTTGATGCAA$
>chr5:164651674-164651746
CTCCATTTCATTTTAGTTATCGCTGGATAGGGTAAAATTAAACTATCGAACGATAACTAAAAAGAAACGGAG
>chr5:182040454-182040600
AGTGAGAGCTGGATGCAGAGGTTTATCGACCGATTCCCGCGTTGTTTGCGAGCTGCTGCAACAATAAAACGCTTGAACCACCACTCCCGGCCGGACCAGCACAAGGCAAGGAAGAGATCGATAAACCACTGCGCCCAGACACCACC

file2.fa

>seq1
CTAACATCAAATTCTATCCATCAGATCCAGCATCTACGGCTGAGTGCAATTCAACCTGAGGATGGCGCCCTATTC
>seq2
CTCCAATCCCCTCTGGTATTTGGACGCCCAAAC
>seq3
TGCACTTTCGGCCCTTAAACTTGTCTGGC
>seq4
GAAACAATATCTTAGGGAGTGTTTGGTTTCTAGAGACTAATTTATAGTCCTTCCATTTTATTC
>seq5
CACGTCAGAGCTTCAACGGTCGGAACCCTACGACCTGGTGACGTGGCTGGCGCACC
>seq6
AGACCTAACACCGCGAATTAATCATCCATGCCATGGAAGCAGCATATGC
>seq7
ACAGCCTTGTTGCCATGTACACACTCAGATATCAACAGTCTCTTATATTTGCTGTTACTACCTGTTA
>seq9
AGCTGATAGCTA

output file

>chr5:111261537-111261740
AGAGTGCTAACATGCTAACATCAAATTCTATCCATCAGATCCAGCATCTACGGCTGAGTGCAATTCAACCTGAGGATGGCGCCCTATTCCGTGAAACCGTAGAGATTCTTTAGAAAATAGTCTAATAGAGGAGGTGGGATTGTACTTAGCCGTAGATGGTGGATCCGATGGTCAGGATTTGATGTTAGCA$
>chr5:121061210-121061338
GCTAGTTTCGGAGCCACAAAACCGGAGGGGATTAGAGGGGCTAAAATCCTCTCCTTATTTAAAATATTGAATAAGGAGGGGATTTTAGCCCCTCCAATCCCCTCTGGTATTTGGACGCCCAAACTAGC
>chr5:132306562-132306879
TGCACTTTCGGCCCTTAAACTTGTCTGGCTGTGTCAACCCGGTCCATAAACTTTTAAAGTATGTCTTTTAAGAACCTAATCTTTGTTCGGGTCTCTATTACGTCCTAACGGTTTGCTGTGGCGCACATGTCAATTGTCATGGTCTATTGGACCGCGTCCACCCTTCTCCCTCCTTCCTCGACGACTAACA$
>chr1:13506448-13506642
GAAACAATATCTTAGGGAGTGTTTGGTTTCTAGAGACTAATTTATAGTCCTTCCATTTTATTCCATTTTAGTTCTTAAATTACTAATTTCTGTATTTGACAGTTTAATTACTAAAATAGAATAAAATTGATGGATTAAAAATTAGTCCCTAGAAACAAAACACCGAGAACTGACAATAATCTAAGATACA$
>chr1:130644204-130644503
CAGTCTACAGGACTTGTCCGGTGCACCACCGGACAGCCAGGCGGGCCCACACGTCAGAGCTTCAACGGTCGGAACCCTACGACCTGGTGACGTGGCTGGCGCACCGGACTGTCCGGTGCACCATGCGACAGACAGCCTCCACCAAACAGCTAGTTTGGTGGAGGCTGTCTGTCGCATGGTGCACCGGACA$
>chr1:13560286-13560440
GCAGTTGAGGGGAATGTTGTCTGGTTGGAGACCTAACACCGCGAATTAATCATCCATGCCATGGAAGCAGCATATGCCCGCCTGCATCTATCCATGCATGATGGTGGAAGGTTTCGGACCAGGCTTCATTCCCCCCAACTCAACACTCAGCTGC
>chr5:160574908-160575134
GAGAGAGGGAGAGGGAGAAACTATGATATTTCATCAAGGCTAAAGGGAGAAGAGAGAGAGTACAGCCTTGTTGCCATGTACACACTCAGATATCAACAGTCTCTTATATTTGCTGTTACTACCTGTTATTAGTGTCCAGTGATGATGAAGGCTGTACCCTCTCTCTTCTTCTCTTAGCTTCTTGATGCAA$
sequence • 3.8k views
ADD COMMENTlink modified 4.7 years ago by gangireddy160 • written 4.7 years ago by Chris30

what do you mean by partial match? If the sequence from 2 files match, print the one from file 1 along with header?

ADD REPLYlink written 4.7 years ago by venu6.8k

yes exactly. So in the example I give e.g. seq1 is part of the first fasta sequence of file one. So I would like to have printed that fasta sequence from file1 in output

ADD REPLYlink written 4.7 years ago by Chris30

Can you elaborate on your requirement of partial matching; from your example, it seems like you want some kind of partial substring matching (a substring that partially matches the longer superstring will report true on a match). Likely, you won't find a one-liner for this.

In order to compare sequences of differing lengths with tolerance for mismatches, you'll have to perform some sort of alignment or string comparison. We can suggest good algorithms for this, but it would be helpful to know what level of tolerance in partial matching that you desire (fixed # of mismatches, some other distance metric, etc.).

ADD REPLYlink written 4.7 years ago by Steven Lakin1.5k

I would say 1 or 2 mismatches in the string

ADD REPLYlink written 4.7 years ago by Chris30
2
gravatar for gangireddy
4.7 years ago by
gangireddy160
gangireddy160 wrote:

I would suggest blasting file2 against file1 and extract sequences sequences with hits from file1. But, it won't be simple one liner.

 makeblastdb -in file1.fasta -parse_seqids -dbtype nucl && blastn -query file2.fasta -db file1.fasta -outfmt 6 |awk '{print $2}' | sort | uniq | xargs samtools faidx file1.fasta > output.fasta
ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by gangireddy160

I get the following error with this: xargs: samtools: No such file or directory. How can I overcome this? Thanks.

ADD REPLYlink written 4.3 years ago by roblogan630

samtools is not installed. try downloading it from here and install.

ADD REPLYlink written 4.2 years ago by gangireddy160
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1009 users visited in the last hour
_