massive pairwise sequence alignment
3
1
Entering edit mode
6.2 years ago
mkh ▴ 60

Hi all, I have two sets of data (let's say A & B ) each includes 1000,000 sequences. I would like to see how many of sequences of one file (e.g. A) has at lease 90% matches to the sequences in the other file (e.g. B). I have already wrote a script which take one sequence from A and do pairwise alignment to all sequences in B and store the score for each matches. (I have used Biopython module) My problem is, the script is very slow and still it did not finish after 4 days run on 21 cores. Do you guys have any idea that how can I improve the script? or is there any better way to do this massive pariwise alignment?

Thanks

alignment sequence • 1.9k views
ADD COMMENT
0
Entering edit mode

Instead of re-inventing the wheel use one of the known MSA programs (clustal, muscle, t-coffee, mafftt). How long are these sequences? A MSA of a million sequences may be tough to view/decipher. Can you de-dupe your data before doing MSA?

ADD REPLY
0
Entering edit mode

Those softwares are deigned for multiple sequence alignment. right? But I dont want to do MSA. seq size ranges btw 100-300 bp.

ADD REPLY
0
Entering edit mode

My bad. Perhaps using an NGS aligner would still be much faster. I suggest looking at bbmap.sh from BBMap suite. Let me try a small test.

ADD REPLY
0
Entering edit mode

Actually these sequences are not related to RNA-seq. I am trying to find unique sequences among a dataset with a million sequence then create a distance matrix based on that.

ADD REPLY
1
Entering edit mode

That is ok. NGS aligners are going to be inherently fast compared to other alignment programs, as long as they can handle fasta format sequences (which bbmap can).

I am trying to find unique sequences among a dataset with a million sequence

This part is very easily done with clumpify.sh. It will be insanely fast. You can allow for errors (subs= option). See this thread for usage information: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files You can use clumpify with fasta sequences. Where ever the commands say fq.gz you can replace those with .fa for your files.

then create a distance matrix based on that

that I am not sure how to do right away.

ADD REPLY
0
Entering edit mode

I'm also puzzled by the 'distance matric' part. Is it correct to assume you mean 'distance matrix' as in phylogenetic context?

You want to create it based on the unique sequences in your dataset? What do you consider ''unique' , not otherwise present in the dataset or less than 90% identical/similar?

ADD REPLY
1
Entering edit mode
6.2 years ago
mkh ▴ 60

I am answering my question, since It may help people with similar problem. The fast massive alignment is possible by applying Bowie. When you do alignment using Bowtie2 it will tell you how many sequences are repeated. And you can delete the repetitive sequences by extracting it from sam file (Bowtie output)

ADD COMMENT
0
Entering edit mode
6.2 years ago

"massive pairwise alignment" ? makes me think of blast ! :-)

I thus suggest to do a blast search and parse the resulting output file.

ADD COMMENT
0
Entering edit mode

Thanks. I have already tried blast and the output is different than pairwise alignment!!

ADD REPLY
0
Entering edit mode

Yes it might be, you could benefit from tweaking the blast parameters a little (eg. do ungapped blast).

I do wonder however how different the result are (can be?) between these two approaches, despite that I'm very well aware from the fact blast is not actually an alignment tool. It will nonetheless be able to answer your question if that is to look for >90% identity.

Are we talking DNA or protein here btw?

ADD REPLY
0
Entering edit mode
6.2 years ago
Joe 21k

Something else you could do is use the CD-HIT suite to cluster the sequences to a particular stringency.

I don’t know how it will handle that many sequences specifically, but it’s usually quite fast I think.

ADD COMMENT

Login before adding your answer.

Traffic: 2527 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6