Question: How to get reads/k-mers that are shared in all samples from a sequenced pool of samples?
0
gravatar for Alba Gonzalez H
21 months ago by
MPI Tubingen
Alba Gonzalez H0 wrote:

Hello!

I have a pool of samples from a RAD-Seq library and I would like to know how can I exctract the reads (rad-tags) that are present/shared in ALL my samples. I have also annotated the reads and will like to do the same knowing to which genes I have reads mapping to in all my samples. This is because I have very different number of reads mapping to my reference in ach sample, from 1000 reads to 500.000 reads and would like to be able to compare all my samples without being biased in terms of to what regions of the genome the reads map to.

Thanks in advance for any suggestion!

Alba

sequence • 752 views
ADD COMMENTlink modified 21 months ago • written 21 months ago by Alba Gonzalez H0
1
gravatar for Brian Bushnell
21 months ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

You can capture the kmers shared in all samples like this, using the BBMap package:

kcompress.sh in=sample1.fq out=kmers1.fa
kcompress.sh in=sample2.fq out=kmers2.fa
kcompress.sh in=sample3.fq out=kmers3.fa

kmercountexact.sh in=kmers1.fa,kmers2.fa,kmers3.fa out=shared.fa mincount=3

You can then capture reads from a sample like this:

bbduk.sh in=sample1.fq outm=sample1_shared.fq ref=shared.fa k=31 mm=f
ADD COMMENTlink written 21 months ago by Brian Bushnell16k
0
gravatar for Alba Gonzalez H
21 months ago by
MPI Tubingen
Alba Gonzalez H0 wrote:

Hi Brian,

Thanks for your fast answer, I have tried BBmap and modify it a bit so I can imput all the invidual files I have. The first part works fine with:

for i in *.fq
do ./kcompress.sh in=$i out="${i%.fq}"_kmers.fa
done

I have problems with

kmercountexact.sh in=kmers1.fa,kmers2.fa,kmers3.fa out=shared.fa mincount=3

I have tried the following but does not work:

       for i in *.fa
       do ./kmercountexact.sh in=$i out=shared.fa mincount=3
       done

it gives me this error:

Exception in thread "main" java.lang.RuntimeException: Output file shared.fa already exists, and overwrite=false
    at jgi.KmerCountExact.<init>(KmerCountExact.java:223)
    at jgi.KmerCountExact.main(KmerCountExact.java:53)

How can I solved it? Moreover, the shared kmers are 100% identical or is there an option that I can be more flexible, so I can get high similar kmers but with enough SNPs to see differences and relationship between my samples?

Best,

Alba

ADD COMMENTlink modified 21 months ago • written 21 months ago by Alba Gonzalez H0
1

By default if a file exists (if you had done a test run then it probably does) then BBMap tools will not overwrite it unless you explicitly add overwrite=t option to your command lines (kmercountexact.sh, in this case).

Please use ADD REPLY/ADD COMMENT when responding to existing posts to keep threads logically organized.

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax63k

Thanks for the advice of keeping the threads, I am new here and did not know.

About overwriting, when I execute the command, it works for the first sample analysed, then it creates the file shared.fa and when it goes to analyse the second sample sure it finds already a shared.fa just created before. The point is, that what the program (as i understood) is supposed to do is going over all the files specified ($i) and find common kmers and output them all together in the single file shared.fa, not try to create a new shared.fa file for every sample analysed. I think the way I am writing the loop is the problem.

Thanks for the repply and any new suggestion

ADD REPLYlink written 21 months ago by Alba Gonzalez H0

As @Brian showed above you need to run kcompress.sh on each sequence file to get kmers in separate files and then run kmercountexact.sh only once on pool of kmer files to generate the shared.fa file.

I think the error showed up because you must have a pre-existing shared.fa file. If you change the output file name to something else e.g. kmercountexact.sh in=kmers1.fa,kmers2.fa,kmers3.fa out=shared_new.fa mincount=3, it should work.

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax63k

The loop still does not work, and it is not because I have a preexisting file called shared.fa. I have more than 3000 samples and the only way I managed it to work is either imputing the kmers.fa files one after another separated by coma as in the example above (trying with few samples, I don't know how to concatenate the 3000 names of my samples one after another separated by comas --> in=sample1,sample2.....sample3000). I have also concatenated all my kmer files in one and imput that, it works but it outputs the kmers in a different order that imputing the kmer files separated by commas.

ADD REPLYlink modified 21 months ago • written 21 months ago by Alba Gonzalez H0
1

The second step cannot be run in a loop; you need to run a single command with all files comma-delimited, and "mincount=X" needs to be set to the total number of files.

If you run "ls kmers*.fa > ls.txt", then using a text editor, you can replace newlines with commas to get a comma-delimited list of all the files.

ADD REPLYlink written 21 months ago by Brian Bushnell16k

Thanks for the clarification. The idea of using the text editor worked fine!

ADD REPLYlink written 20 months ago by Alba Gonzalez H0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1888 users visited in the last hour