Question: Working with human database in Kraken
0
gravatar for Ron
19 months ago by
Ron690
United States
Ron690 wrote:

Hi all,

I have build the kraken human database using the following steps.I think there is some error in this,as I am not getting the classification of human reads ,but only bacterial and viral.Below are the steps that I undertook to build the customized database.Creating the standard database worked as I am getting the output for bacterial and viral reads.

The link to the manual is here : http://ccb.jhu.edu/software/kraken/MANUAL.html

 **Step1**: kraken-build --download-library human  --threads 4 --db/software/kraken-0.10.5-beta

**Step 2**: 

export PATH=/software/jellyfish-1.1.11/bin:$PATH

cd /software/kraken-0.10.5-beta/library/Human





for file in *chr*.fa

do

/software/kraken-0.10.5-beta/kraken-build --add-to-library $file --db /software/kraken-0.10.5-beta

done




   **Step3**: 


   /software/kraken-0.10.5-beta/kraken-build --build --threads 4 --db /software/kraken-0.10.5-beta

After building the human data for kraken,I run kraken to get the composition ,but still get the human reads as unclassified:

 /software/kraken-0.10.5-beta/kraken  --preload --threads 16 --db software/kraken-0.10.5-beta --fastq-input --gzip-     compressed  --paired $inputdir/Sample_PDX75_81_no_EC_R1.fastq.gz $inputdir/Sample_PDX75_81_no_EC_R2.fastq.gz

$inputdir/kraken_out

The output log file:
189620 sequences classified (0.30%)

62326599 sequences unclassified (99.70%)

Thanks, Ron

rna-seq next-gen genome • 1.2k views
ADD COMMENTlink modified 19 months ago • written 19 months ago by Ron690

Kraken is taxonomic classification system. Are you looking to separate human reads in your dataset? Kraken is not the right tool for that.

Other tools may be more appropriate: Tool to separate human and mouse rna seq reads

ADD REPLYlink modified 19 months ago • written 19 months ago by genomax40k

No.I am looking to get the percentage of reads that belongs to each taxonomy. I know how to separate the human and mouse reads.

ADD REPLYlink modified 19 months ago • written 19 months ago by Ron690

I must be missing something obvious.

I am looking to get the percentage of reads that belongs to each taxonomy.

You are looking to get that for bacterial/viral/metagenomic data, correct? I am not sure why you are using the human database with kraken.

ADD REPLYlink written 19 months ago by genomax40k

Yes,thats correct.But also I want how much percentage of reads belongs to human and mouse. Do you think of any other way to get that composition ? for e.g i would like to see the result of classification as Bacteria .01%,Viruses .5%, Human 96%,Mouse 3%

ADD REPLYlink written 19 months ago by Ron690
1

You could do that using BBSplit from BBMap (A: Tool to separate human and mouse ran seq reads , click this link to go to the BBsplit part of the thread). This is the reason I had linked that thread above.

As for how many are human/mouse you can get that from BBSplit.Take the reads that don't map to human/mouse data from BBSplit and then run them through kraken with their bacterial/viral database to get the other % you need.

ADD REPLYlink modified 19 months ago • written 19 months ago by genomax40k

Which option in bbmap can be used to output the unmapped reads.I am using Seal instead of bbsplit since it is fast.

ADD REPLYlink written 19 months ago by Ron690

@Brian lists these options for seal.sh.

outu=<file>         (outunmatched) Write reads here that do not contain kmers 
                    matching the database.
outu2=<file>        (outunmatched2) Use this to write 2nd read of pairs to a 
                    different file.
ADD REPLYlink written 19 months ago by genomax40k

Can we get the paired end reads separated for human and mouse using seal ?Currently I am using seal to separate just human and mouse and then using reformat script to create paired end reads for each.

   bbmap/seal.sh in1=$TMPDIR/input/${input_file} in2=$TMPDIR/input/${input_file%R1*}R2.fastq.gz ref=/hg19.fa,/mm10.fa  pattern=$TMPDIR/dest/${input_file%R1*}_%.fq.gz ambig=all k=27 threads=32 prealloc speed=8

This creates one human fastq and one mouse Sample_PDX68_72_74_pool_EC__hg19.fq.gz, Sample_PDX68_72_74_pool_EC__mm10.fq.gz

ADD REPLYlink written 19 months ago by Ron690
1

I will tag Brian Bushnell

ADD REPLYlink written 19 months ago by genomax40k

I think you are doing it the right way.

@Brian may be along later. If there is an undocumented way then that would be our chance to hear about it.

ADD REPLYlink written 19 months ago by genomax40k

Sorry, that's the only way to do it right now. Seal does support the "#" symbol in input, so you can say "in=r#.fq" instead of "in1=r1.fq in2=r2.fq". But that is not currently supported for "pattern" output.

Many programs don't require paired reads to be in 2 files, though, so depending on what you are doing, you may not need to de-interleave them.

ADD REPLYlink modified 19 months ago • written 19 months ago by Brian Bushnell15k

Hi Brian,

I didn't get the unmapped reads using the options "outu" and "outu2 .Below is my script :

    software/bbmap/seal.sh in1=$TMPDIR/input/${input_file} in2=$TMPDIR/input/${input_file%R1*}R2.fastq.gz ref=/hg19.fa /mm10.fa  pattern=$TMPDIR/dest/${input_file%R1*}_%.fq.gz ambig=all k=27 threads=32 prealloc speed=8 refnames=t outu=$TMPDIR/dest/${input_file%R1*}R1.fastq.gz outu2=$TMPDIR/dest/${input_file%R1*}R2.fastq.gz stats=$TMPDIR/dest/${input_file%R1*}.contaminants

output files :

Sample_PDX68_72_74_pool_EC_.contaminants
Sample_PDX68_72_74_pool_EC__hg19.fq.gz
Sample_PDX68_72_74_pool_EC__mm10.fq.gz
ADD REPLYlink modified 19 months ago • written 19 months ago by Ron690

You need to specify the two reference sequences as follows (separated by a comma)

ref=hg19.fa,mm10.fa

Also /hg19.fa /mm10.fa paths are likely incorrect (unless you really have your files in the root directory). So provide the right paths.

My guess is based on the original post you have a very small # of sequences that are non-human, non-mouse (if I am reading the kraken result right): 189620

ADD REPLYlink modified 19 months ago • written 19 months ago by genomax40k

Thanks.I know that.It was a copy paste error.I have the output files as mentioned above.

Its just I dint get the unmapped files.

ADD REPLYlink written 19 months ago by Ron690

Were the files created but empty, or not created at all? And how many unmapped reads were there?

ADD REPLYlink written 19 months ago by Brian Bushnell15k

Not created at all.

 Input is being processed as paired
 Started output streams:    0.090 seconds.
 Processing time:           128.779 seconds.

  Input:                    25663220 reads      2591985220 bases.
  Matched reads:            24325834 reads (94.79%)     2456909234 bases (94.79%)
  Unmatched reads:          1337386 reads (5.21%)   135075986 bases (5.21%)

 Time:              362.447 seconds.
  Reads Processed:      25663k  70.81k reads/sec
  Bases Processed:       2591m  7.15m bases/sec
 sending incremental file list
 dest/
 dest/Sample_PDX68_72_74_pool_EC_.contaminants
 dest/Sample_PDX68_72_74_pool_EC__hg19.fq.gz
 dest/Sample_PDX68_72_74_pool_EC__mm10.fq.gz
ADD REPLYlink written 19 months ago by Ron690
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1737 users visited in the last hour