How to get reads/k-mers that are shared in all samples from a sequenced pool of samples?
1
0
Entering edit mode
6.9 years ago

Hello!

I have a pool of samples from a RAD-Seq library and I would like to know how can I exctract the reads (rad-tags) that are present/shared in ALL my samples. I have also annotated the reads and will like to do the same knowing to which genes I have reads mapping to in all my samples. This is because I have very different number of reads mapping to my reference in ach sample, from 1000 reads to 500.000 reads and would like to be able to compare all my samples without being biased in terms of to what regions of the genome the reads map to.

Thanks in advance for any suggestion!

Alba

sequence • 2.2k views
ADD COMMENT
1
Entering edit mode
6.9 years ago

You can capture the kmers shared in all samples like this, using the BBMap package:

kcompress.sh in=sample1.fq out=kmers1.fa
kcompress.sh in=sample2.fq out=kmers2.fa
kcompress.sh in=sample3.fq out=kmers3.fa

kmercountexact.sh in=kmers1.fa,kmers2.fa,kmers3.fa out=shared.fa mincount=3

You can then capture reads from a sample like this:

bbduk.sh in=sample1.fq outm=sample1_shared.fq ref=shared.fa k=31 mm=f
ADD COMMENT
0
Entering edit mode

Hi Brian,

Thanks for your fast answer, I have tried BBmap and modify it a bit so I can imput all the invidual files I have. The first part works fine with:

for i in *.fq
do ./kcompress.sh in=$i out="${i%.fq}"_kmers.fa
done

I have problems with

kmercountexact.sh in=kmers1.fa,kmers2.fa,kmers3.fa out=shared.fa mincount=3

I have tried the following but does not work:

       for i in *.fa
       do ./kmercountexact.sh in=$i out=shared.fa mincount=3
       done

it gives me this error:

Exception in thread "main" java.lang.RuntimeException: Output file shared.fa already exists, and overwrite=false
    at jgi.KmerCountExact.<init>(KmerCountExact.java:223)
    at jgi.KmerCountExact.main(KmerCountExact.java:53)

How can I solved it? Moreover, the shared kmers are 100% identical or is there an option that I can be more flexible, so I can get high similar kmers but with enough SNPs to see differences and relationship between my samples?

Best,

Alba

ADD REPLY
1
Entering edit mode

By default if a file exists (if you had done a test run then it probably does) then BBMap tools will not overwrite it unless you explicitly add overwrite=t option to your command lines (kmercountexact.sh, in this case).

Please use ADD REPLY/ADD COMMENT when responding to existing posts to keep threads logically organized.

ADD REPLY
0
Entering edit mode

Thanks for the advice of keeping the threads, I am new here and did not know.

About overwriting, when I execute the command, it works for the first sample analysed, then it creates the file shared.fa and when it goes to analyse the second sample sure it finds already a shared.fa just created before. The point is, that what the program (as i understood) is supposed to do is going over all the files specified ($i) and find common kmers and output them all together in the single file shared.fa, not try to create a new shared.fa file for every sample analysed. I think the way I am writing the loop is the problem.

Thanks for the repply and any new suggestion

ADD REPLY
0
Entering edit mode

As @Brian showed above you need to run kcompress.sh on each sequence file to get kmers in separate files and then run kmercountexact.sh only once on pool of kmer files to generate the shared.fa file.

I think the error showed up because you must have a pre-existing shared.fa file. If you change the output file name to something else e.g. kmercountexact.sh in=kmers1.fa,kmers2.fa,kmers3.fa out=shared_new.fa mincount=3, it should work.

ADD REPLY
0
Entering edit mode

The loop still does not work, and it is not because I have a preexisting file called shared.fa. I have more than 3000 samples and the only way I managed it to work is either imputing the kmers.fa files one after another separated by coma as in the example above (trying with few samples, I don't know how to concatenate the 3000 names of my samples one after another separated by comas --> in=sample1,sample2.....sample3000). I have also concatenated all my kmer files in one and imput that, it works but it outputs the kmers in a different order that imputing the kmer files separated by commas.

ADD REPLY
1
Entering edit mode

The second step cannot be run in a loop; you need to run a single command with all files comma-delimited, and "mincount=X" needs to be set to the total number of files.

If you run "ls kmers*.fa > ls.txt", then using a text editor, you can replace newlines with commas to get a comma-delimited list of all the files.

ADD REPLY
0
Entering edit mode

Thanks for the clarification. The idea of using the text editor worked fine!

ADD REPLY

Login before adding your answer.

Traffic: 1716 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6