Working with human database in Kraken
0
0
Entering edit mode
5.0 years ago
Ron ★ 1.1k

Hi all,

I have build the kraken human database using the following steps.I think there is some error in this,as I am not getting the classification of human reads ,but only bacterial and viral.Below are the steps that I undertook to build the customized database.Creating the standard database worked as I am getting the output for bacterial and viral reads.

The link to the manual is here : http://ccb.jhu.edu/software/kraken/MANUAL.html

 **Step1**: kraken-build --download-library human  --threads 4 --db/software/kraken-0.10.5-beta

**Step 2**:

export PATH=/software/jellyfish-1.1.11/bin:$PATH cd /software/kraken-0.10.5-beta/library/Human for file in *chr*.fa do /software/kraken-0.10.5-beta/kraken-build --add-to-library$file --db /software/kraken-0.10.5-beta

done

**Step3**:

/software/kraken-0.10.5-beta/kraken-build --build --threads 4 --db /software/kraken-0.10.5-beta


After building the human data for kraken,I run kraken to get the composition ,but still get the human reads as unclassified:

 /software/kraken-0.10.5-beta/kraken  --preload --threads 16 --db software/kraken-0.10.5-beta --fastq-input --gzip-     compressed  --paired $inputdir/Sample_PDX75_81_no_EC_R1.fastq.gz$inputdir/Sample_PDX75_81_no_EC_R2.fastq.gz


$inputdir/kraken_out The output log file: 189620 sequences classified (0.30%) 62326599 sequences unclassified (99.70%)  Thanks, Ron RNA-Seq next-gen genome • 3.2k views ADD COMMENT 0 Entering edit mode Kraken is taxonomic classification system. Are you looking to separate human reads in your dataset? Kraken is not the right tool for that. Other tools may be more appropriate: Tool to separate human and mouse rna seq reads ADD REPLY 0 Entering edit mode No.I am looking to get the percentage of reads that belongs to each taxonomy. I know how to separate the human and mouse reads. ADD REPLY 0 Entering edit mode I must be missing something obvious. I am looking to get the percentage of reads that belongs to each taxonomy. You are looking to get that for bacterial/viral/metagenomic data, correct? I am not sure why you are using the human database with kraken. ADD REPLY 0 Entering edit mode Yes,thats correct.But also I want how much percentage of reads belongs to human and mouse. Do you think of any other way to get that composition ? for e.g i would like to see the result of classification as Bacteria .01%,Viruses .5%, Human 96%,Mouse 3% ADD REPLY 1 Entering edit mode You could do that using BBSplit from BBMap (A: Tool to separate human and mouse ran seq reads , click this link to go to the BBsplit part of the thread). This is the reason I had linked that thread above. As for how many are human/mouse you can get that from BBSplit.Take the reads that don't map to human/mouse data from BBSplit and then run them through kraken with their bacterial/viral database to get the other % you need. ADD REPLY 0 Entering edit mode Which option in bbmap can be used to output the unmapped reads.I am using Seal instead of bbsplit since it is fast. ADD REPLY 0 Entering edit mode @Brian lists these options for seal.sh. outu=<file> (outunmatched) Write reads here that do not contain kmers matching the database. outu2=<file> (outunmatched2) Use this to write 2nd read of pairs to a different file.  ADD REPLY 0 Entering edit mode Can we get the paired end reads separated for human and mouse using seal ?Currently I am using seal to separate just human and mouse and then using reformat script to create paired end reads for each.  bbmap/seal.sh in1=$TMPDIR/input/${input_file} in2=$TMPDIR/input/${input_file%R1*}R2.fastq.gz ref=/hg19.fa,/mm10.fa pattern=$TMPDIR/dest/${input_file%R1*}_%.fq.gz ambig=all k=27 threads=32 prealloc speed=8  This creates one human fastq and one mouse Sample_PDX68_72_74_pool_EC__hg19.fq.gz, Sample_PDX68_72_74_pool_EC__mm10.fq.gz ADD REPLY 1 Entering edit mode I will tag Brian Bushnell ADD REPLY 0 Entering edit mode I think you are doing it the right way. @Brian may be along later. If there is an undocumented way then that would be our chance to hear about it. ADD REPLY 0 Entering edit mode Sorry, that's the only way to do it right now. Seal does support the "#" symbol in input, so you can say "in=r#.fq" instead of "in1=r1.fq in2=r2.fq". But that is not currently supported for "pattern" output. Many programs don't require paired reads to be in 2 files, though, so depending on what you are doing, you may not need to de-interleave them. ADD REPLY 0 Entering edit mode Hi Brian, I didn't get the unmapped reads using the options "outu" and "outu2 .Below is my script :  software/bbmap/seal.sh in1=$TMPDIR/input/${input_file} in2=$TMPDIR/input/${input_file%R1*}R2.fastq.gz ref=/hg19.fa /mm10.fa pattern=$TMPDIR/dest/${input_file%R1*}_%.fq.gz ambig=all k=27 threads=32 prealloc speed=8 refnames=t outu=$TMPDIR/dest/${input_file%R1*}R1.fastq.gz outu2=$TMPDIR/dest/${input_file%R1*}R2.fastq.gz stats=$TMPDIR/dest/\${input_file%R1*}.contaminants


output files :

Sample_PDX68_72_74_pool_EC_.contaminants
Sample_PDX68_72_74_pool_EC__hg19.fq.gz
Sample_PDX68_72_74_pool_EC__mm10.fq.gz

0
Entering edit mode

You need to specify the two reference sequences as follows (separated by a comma)

ref=hg19.fa,mm10.fa


Also /hg19.fa /mm10.fa paths are likely incorrect (unless you really have your files in the root directory). So provide the right paths.

My guess is based on the original post you have a very small # of sequences that are non-human, non-mouse (if I am reading the kraken result right): 189620

0
Entering edit mode

Thanks.I know that.It was a copy paste error.I have the output files as mentioned above.

Its just I dint get the unmapped files.

0
Entering edit mode

Were the files created but empty, or not created at all? And how many unmapped reads were there?

0
Entering edit mode

Not created at all.

 Input is being processed as paired
Started output streams:    0.090 seconds.
Processing time:           128.779 seconds.

Time:              362.447 seconds.
Bases Processed:       2591m  7.15m bases/sec
sending incremental file list
dest/
dest/Sample_PDX68_72_74_pool_EC_.contaminants
dest/Sample_PDX68_72_74_pool_EC__hg19.fq.gz
dest/Sample_PDX68_72_74_pool_EC__mm10.fq.gz