Entering edit mode
7.0 years ago
Ron ★ 1.1k
I have build the kraken human database using the following steps.I think there is some error in this,as I am not getting the classification of human reads ,but only bacterial and viral.Below are the steps that I undertook to build the customized database.Creating the standard database worked as I am getting the output for bacterial and viral reads.
The link to the manual is here : http://ccb.jhu.edu/software/kraken/MANUAL.html
**Step1**: kraken-build --download-library human --threads 4 --db/software/kraken-0.10.5-beta **Step 2**: export PATH=/software/jellyfish-1.1.11/bin:$PATH cd /software/kraken-0.10.5-beta/library/Human for file in *chr*.fa do /software/kraken-0.10.5-beta/kraken-build --add-to-library $file --db /software/kraken-0.10.5-beta done **Step3**: /software/kraken-0.10.5-beta/kraken-build --build --threads 4 --db /software/kraken-0.10.5-beta
After building the human data for kraken,I run kraken to get the composition ,but still get the human reads as unclassified:
/software/kraken-0.10.5-beta/kraken --preload --threads 16 --db software/kraken-0.10.5-beta --fastq-input --gzip- compressed --paired $inputdir/Sample_PDX75_81_no_EC_R1.fastq.gz $inputdir/Sample_PDX75_81_no_EC_R2.fastq.gz
The output log file: 189620 sequences classified (0.30%) 62326599 sequences unclassified (99.70%)
Kraken is taxonomic classification system. Are you looking to separate human reads in your dataset? Kraken is not the right tool for that.
Other tools may be more appropriate: Tool to separate human and mouse rna seq reads
No.I am looking to get the percentage of reads that belongs to each taxonomy. I know how to separate the human and mouse reads.
I must be missing something obvious.
You are looking to get that for bacterial/viral/metagenomic data, correct? I am not sure why you are using the human database with kraken.
Yes,thats correct.But also I want how much percentage of reads belongs to human and mouse. Do you think of any other way to get that composition ? for e.g i would like to see the result of classification as Bacteria .01%,Viruses .5%, Human 96%,Mouse 3%
You could do that using BBSplit from BBMap (A: Tool to separate human and mouse ran seq reads , click this link to go to the BBsplit part of the thread). This is the reason I had linked that thread above.
As for how many are human/mouse you can get that from BBSplit.Take the reads that don't map to human/mouse data from BBSplit and then run them through kraken with their bacterial/viral database to get the other % you need.
Which option in bbmap can be used to output the unmapped reads.I am using Seal instead of bbsplit since it is fast.
@Brian lists these options for
Can we get the paired end reads separated for human and mouse using seal ?Currently I am using seal to separate just human and mouse and then using reformat script to create paired end reads for each.
This creates one human fastq and one mouse Sample_PDX68_72_74_pool_EC__hg19.fq.gz, Sample_PDX68_72_74_pool_EC__mm10.fq.gz
I will tag Brian Bushnell
I think you are doing it the right way.
@Brian may be along later. If there is an undocumented way then that would be our chance to hear about it.
Sorry, that's the only way to do it right now. Seal does support the "#" symbol in input, so you can say "in=r#.fq" instead of "in1=r1.fq in2=r2.fq". But that is not currently supported for "pattern" output.
Many programs don't require paired reads to be in 2 files, though, so depending on what you are doing, you may not need to de-interleave them.
I didn't get the unmapped reads using the options "outu" and "outu2 .Below is my script :
output files :
You need to specify the two reference sequences as follows (separated by a comma)
Also /hg19.fa /mm10.fa paths are likely incorrect (unless you really have your files in the root directory). So provide the right paths.
My guess is based on the original post you have a very small # of sequences that are non-human, non-mouse (if I am reading the kraken result right): 189620
Thanks.I know that.It was a copy paste error.I have the output files as mentioned above.
Its just I dint get the unmapped files.
Were the files created but empty, or not created at all? And how many unmapped reads were there?
Not created at all.