Hello!
I have a pool of samples from a RAD-Seq library and I would like to know how can I exctract the reads (rad-tags) that are present/shared in ALL my samples. I have also annotated the reads and will like to do the same knowing to which genes I have reads mapping to in all my samples. This is because I have very different number of reads mapping to my reference in ach sample, from 1000 reads to 500.000 reads and would like to be able to compare all my samples without being biased in terms of to what regions of the genome the reads map to.
Thanks in advance for any suggestion!
Alba
Hi Brian,
Thanks for your fast answer, I have tried BBmap and modify it a bit so I can imput all the invidual files I have. The first part works fine with:
I have problems with
I have tried the following but does not work:
it gives me this error:
How can I solved it? Moreover, the shared kmers are 100% identical or is there an option that I can be more flexible, so I can get high similar kmers but with enough SNPs to see differences and relationship between my samples?
Best,
Alba
By default if a file exists (if you had done a test run then it probably does) then BBMap tools will not overwrite it unless you explicitly add
overwrite=t
option to your command lines (kmercountexact.sh
, in this case).Please use
ADD REPLY/ADD COMMENT
when responding to existing posts to keep threads logically organized.Thanks for the advice of keeping the threads, I am new here and did not know.
About overwriting, when I execute the command, it works for the first sample analysed, then it creates the file shared.fa and when it goes to analyse the second sample sure it finds already a shared.fa just created before. The point is, that what the program (as i understood) is supposed to do is going over all the files specified ($i) and find common kmers and output them all together in the single file shared.fa, not try to create a new shared.fa file for every sample analysed. I think the way I am writing the loop is the problem.
Thanks for the repply and any new suggestion
As @Brian showed above you need to run
kcompress.sh
on each sequence file to get kmers in separate files and then runkmercountexact.sh
only once on pool of kmer files to generate theshared.fa
file.I think the error showed up because you must have a pre-existing
shared.fa
file. If you change the output file name to something else e.g.kmercountexact.sh in=kmers1.fa,kmers2.fa,kmers3.fa out=shared_new.fa mincount=3
, it should work.The loop still does not work, and it is not because I have a preexisting file called shared.fa. I have more than 3000 samples and the only way I managed it to work is either imputing the kmers.fa files one after another separated by coma as in the example above (trying with few samples, I don't know how to concatenate the 3000 names of my samples one after another separated by comas --> in=sample1,sample2.....sample3000). I have also concatenated all my kmer files in one and imput that, it works but it outputs the kmers in a different order that imputing the kmer files separated by commas.
The second step cannot be run in a loop; you need to run a single command with all files comma-delimited, and "mincount=X" needs to be set to the total number of files.
If you run "ls kmers*.fa > ls.txt", then using a text editor, you can replace newlines with commas to get a comma-delimited list of all the files.
Thanks for the clarification. The idea of using the text editor worked fine!