I have a fasta file with 44396570 sequences and I want to find them in my genome. I am using bowtie2-build to build index files for my fasta file, but I get an error "killed". I guess it's because I need more memory. I have 32gb on my machine. I tried to split my file in 2 and 4 parts but it didn't solve my problem. I also made sure that bowtie2-build works on one single sequence from my fasta file and everything was ok, but bowtie2 alignment of my whole genome on this single sequence took a huge amount of time and I am worried that splitting my fasta file into more parts will take years to perform the alignment.
Maybe someone already solved this problem, when you want to find a lot of sequences in your genome?
I would be grateful if you could help me.
Normal procedure is to use your sequences and search against the genome, is there a reason you want to do this the other way round? Are your sequences short reads or long(er)? How large is the "genome" in this case?
Many NGS aligners are not designed to search with very long sequences (say > 1kb, with a few exceptions, e.g.
blasrfrom PacBio). BBMap contains
mapPacBio.shwhich can map long reads (in 6kb chunks) and
bwaalso works with PacBio/Nanopore reads.
I have experimental WGS data and I am interested in presence or absence of my sequences in genomes of these exact patients, not in reference genome. My sequences are longer then reads. They are different length, but something like gene length, not shorter.
Have you thought about using
blat/blast/lastz? Depending on how similar you expect your sequences to be when compared the "genome" you could go from "near identical to more diverse" in rough order from left to right for the programs above.
Did not think about blat, thank you! I expect the sequences to be identical so I think it should work. May I also ask you one more question? My genomes are in fastq format, should I align them on human genome or convert somehow into fasta to use blat? Thank you in advance!
I now understand what you are trying to do.
So your "genomes" are actually short reads which you want to align against some other longer sequences. You could convert them to fasta and then use one of the programs above but that may still be a long endeavor since you have many (milions?) of those reads.
What is the total size of your long reads (in terms of file size/mega-gigabases of sequence)? You could use
bbmap.shfrom BBMap suite to align your genome reads (in original fastq format) to the other dataset. This may be the best option in terms of getting this done rapidly.
The size of the "genomes" is ~200gb for each of the 2 paired-end files. The size of the fasta file with my sequences I want to find is 24gb. Is it ok for BBMap?
Now things are coming into focus :)
24G of "reference" would not be good for most aligners with the amount of RAM you have available. If you had plenty of RAM then this would be no problem.
How about using "divide and conquer"? How many sequences are there in the "fasta" file? You could divide them in 8 x 3Gb chunks (which should be about the size of human genome) and then do 8 independent searches.
Haha :) There are 44396570 sequences in it. I tried to split into 4 parts, failed, got upset and came here :) I can definitely try to split the fasta file into 8 parts. Do you think I should use BBMap or Bowtie?
Thank you very much for your help! You are my hero :)
Before we celebrate try one 3 Gb chunk out first. I am partial to bbmap but either aligner should work at that size :) Keep an eye on free disk space and add more if you can, just to be proactive. With bbmap you can directly create bam files as long as you have samtools available in $PATH.
At 200G each (R1/R2) you must have a boatload of reads in there (billions?).
Ok, I'll do that now! Anyway, thank you very much!
Yes. Originally I had .bam files provided by the guys who did the experiment. Each file was ~150G and I converted each .bam file into two .fastq files with bedtools bam2fastq. And now I have two 200G for each genome.
Tada! I started with bowtie and bowtie-build part is completed without any issues, now I'm waiting for the alignment! Thank you very much!! You definitely are my hero! :)
Glad to hear that appears to be working.
BTW: You don't need to up vote each one of my posts. Only one (or some) that helped provide a solution/direction would be sufficient.
Wow. Didn't know this feature, thank you for the information. How do people solve these issues then? I thought that standard procedure is to align the WGS data on whole chromosomes. Maybe I could concat all my sequences into one sequence, like genes on the chromosome and align everything on it and then search whether reads mapped each of the parts of the sequence? Or is this crazy? :) Sorry for stupid questions, I am doing this for the first time
Please see the edited post above since I made some changes there.
What is the size distribution of your reads and are the "genomes" you are searching against in form of chromosomes or contigs or something else?
My genomes are two fastq files with paired-ended reads for each of the genomes. The size of reads is 100. I am searching against sequences that are actually specific combinations of exons of human genes. This is why their length is similar to length of genes. Each combination is a different sequence and there are thousands of these combinations.
I am surprised that you were not able to build the index with
bowtie2-buildif your sequences are a "transcriptome". Did the job get killed right away or it ran for sometime before it died? I wonder if you ran out of disk storage space rather then memory (32G should be enough, if all of it is available for the OS).
I have 200 gb on my disk and can add more. It died while doing something with the "buckets", unfortunately I didn't save the output and do not remember exactly when it died. It worked for something like half an hour before died.