Question: Tool to separate human and mouse rna seq reads
2
gravatar for Ron
3.9 years ago by
Ron910
United States
Ron910 wrote:

Hi,

I have PDX RNA-seq data.I want to separate the reads of mouse and human,before comparing with tumor rna-seq data.Is there any tool to do that?

After removing the mouse reads I want to do differential expression between primary tumors and PDX tumors.

Thanks,

Ron

 

sequencing rna-seq next-gen • 4.7k views
ADD COMMENTlink modified 3.5 years ago by Sukhdeep Singh9.6k • written 3.9 years ago by Ron910
5
gravatar for ethan.kaufman
3.9 years ago by
ethan.kaufman360
Canada
ethan.kaufman360 wrote:

Xenome works pretty well.

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by ethan.kaufman360

the page seems to have moved.

ADD REPLYlink written 3.9 years ago by Ron910

Hmm, that's unfortunate.  Maybe you can try contacting the corresponding author?  Another approach would simply be to align first to the mouse reference, and then use the unmapped reads for your analysis, but as I recall Xenome uses a more sophisticated filtering strategy.

ADD REPLYlink written 3.9 years ago by ethan.kaufman360

Found this link: http://bioinformatics.research.nicta.com.au/software/xenome/

ADD REPLYlink written 3.9 years ago by ethan.kaufman360

This link too is now dead. Anyone have a new source?

ADD REPLYlink written 2.8 years ago by jasper191810

appears to be bundled here https://github.com/data61/gossamer

if it does not work (recent build, so should be OK), I have the original binaries that I downloaded from the NICTA site.

ADD REPLYlink written 19 months ago by Klugman20

Klugman, I downloaded the current version from Github. I'm facing the following issue - https://github.com/data61/gossamer/issues/9. Could you please share the original binaries you downloaded from NICTA?

ADD REPLYlink written 19 months ago by kannabiran.n0
3
gravatar for Brian Bushnell
3.9 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

BBSplit will map the reads simultaneously to multiple references, and output in multiple files, one per reference. Since you're looking at RNA-seq of tumors, you would want to increase the sensitivity over the default, using settings like (if mapping to the genomes):

bbsplit.sh in=reads.fq ref=hg19.fa,mm10.fa minratio=0.5 maxindel=100000 minhits=1 basename=out_%.fq ambig2=all local

If mapping to transcriptomes, maxindel=500 would probably be better than 100000.

ADD COMMENTlink modified 5 weeks ago by RamRS20k • written 3.9 years ago by Brian Bushnell16k

I have paired end reads in format of fastq.gz so how can I input them.

Also In this case I don't have to index hg19.fa and mm10.fa?

ADD REPLYlink written 3.9 years ago by Ron910
1

BBSplit will automatically index them when you give it that command, if they were not already indexed.  Also, it will accept fastq.gz as input or output; you don't need to unzip them.

ADD REPLYlink written 3.9 years ago by Brian Bushnell16k

The command produced two folders namely genome and index.The genome folder has these files :

chr10.chrom.gz
chr11.chrom.gz
chr12.chrom.gz
chr1.chrom.gz
chr2.chrom.gz
chr3.chrom.gz
chr4.chrom.gz
chr5.chrom.gz
chr6.chrom.gz
chr7.chrom.gz
chr8.chrom.gz
chr9.chrom.gz
info.txt
merged_ref_6708417211327.fa.gz
namelist.txt
reflist.txt
scaffolds.txt.gz
summary.txt

the index folder has these files:

chr12_index_k13_c2_b1.block
chr12_index_k13_c2_b1.block2.gz
chr1-3_index_k13_c2_b1.block
chr1-3_index_k13_c2_b1.block2.gz
chr4-7_index_k13_c2_b1.block
chr4-7_index_k13_c2_b1.block2.gz
chr8-11_index_k13_c2_b1.block
chr8-11_index_k13_c2_b1.block2.gz

I was expecting paired end reads for mouse and paired end reads for human.

ADD REPLYlink modified 6 months ago by RamRS20k • written 3.8 years ago by Ron910
1

Hi Ron,

The contents of /ref/genome/ and /ref/index/ are just the index; don't worry about that. The input reads should have been specified as in=reads.fq, or for paired reads named read1.fq and read2.fq, in=read#.fq Then if you set basename=out_%.fq your output paired reads would be out_hg19.fq and out_mm10.fq.

Those will be paired and interleaved. You can deinterleave them with reformat.sh if you want them in two files. Are the files out_hg19.fq and out_mm10.fq present? They should be in the working directory you were in when you ran the comand (not in /ref/).

ADD REPLYlink modified 5 weeks ago by RamRS20k • written 3.8 years ago by Brian Bushnell16k

Yes.I got the two files out_hg19.fq and out_mm10.fq.

Now I have interleaved them using reformat.sh script and then I compressed the reads using gzip.Thanks Brian.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Ron910

I am getting this error while running a parallel job.It worked well for a single sample.Do I have to use -Xmx3g option ?if yes what would be the appropriate memory for java?

Could not create the Java virtual machine.
Error occurred during initialization of VM
Could not reserve enough space for object heap
java -Djava.library.path=/rob2056/software/bbmap/jni/ -ea -Xmx52753m -cp /rob2056/software/bbmap/current/ align2.BBSplitter build=1 ow=t fastareadlen=500 minhits=2 minratio=0.9 maxindel=20 trim=rl untrim=t in1=/tmp/419629.1.standard.q/inpu
t/Sample_ZA30_cells_R1.fastq.gz in2=/tmp/419629.1.standard.q/input2/Sample_ZA30_cells_R2.fastq.gz ref=/pbtech_mounts/fdlab_store003/fdlab/genomes/human/hg19/indexes/star/hg19.fa,/pbtech_mounts/fdlab_store003/fdlab/genomes/mouse/mm10/genome/mm10.fa minratio=0.5 maxindel=100000 minhits=1 basename=/tmp/419629.1.standard.q/dest/Sample_ZA30_cells_R1.fastq_%.fq ambig
2=all local
ADD REPLYlink modified 6 months ago by RamRS20k • written 3.8 years ago by Ron910

If you want to run multiple copies of BBSplit simultaneously on the same machine, for mouse and human reference together, you can include the flag -Xmx48g.  There is not much point in running it simultaneously, though, because BBSplit is multithreaded and automatically uses all available cores (unless you set the number of threads with e.g. "t=4").  So, 2 instances at once on the same machine will not go any faster than 2 instances one after the other.

ADD REPLYlink written 3.8 years ago by Brian Bushnell16k

I am running on SGE cluster with gnu parallel as I have 20 samples.The only thing is I want to run all samples in parallel.I have given -Xmx20g as of now,if not then will give 48g.

ADD REPLYlink written 3.8 years ago by Ron910

Hi Brian,

I am running BBmap on 2 samples which have fastq of sizes 50G each (paired end).These have been running over since 4 days.Is there any way to make the process faster?

bbmap/bbsplit.sh in1=$TMPDIR/input/${input_file} in2=$TMPDIR/input/${input_file%R1*}R2.fastq.gz ref=/hg19.fa,/mm10.fa minratio=0.5 maxindel=500 minhits=1 basename=$TMPDIR/dest/${input_file%R1*}_%.fq ambig2=all local
ADD REPLYlink modified 6 months ago by RamRS20k • written 3.5 years ago by Ron910

Ron,

How many threads are you using?  This should really not take very long unless you're restricting it to a single thread or something.  Also, are you sure it has not crashed?  Run "top" to verify that it is still consuming CPU resources, and check the output to stderr as well - if it crashes, it will print some kind of "Exception".

Adding the "fast" flag and increasing minratio will make it go faster, at the expense of reduced sensitivity.  Alternately, you can now split datasets between references with Seal, which is WAY faster but requires more memory (it uses kmer-matching rather than alignment).  The amount of memory it uses is tunable, though, again with a tradeoff of sensitivity.

ADD REPLYlink written 3.5 years ago by Brian Bushnell16k

Here is the log file :

java -Djava.library.path=/software/bbmap/jni/ -ea -Xmx154418m -cp /software/bbmap/current/ align2.BBSplitter build=1 ow=t fastareadlen=500 minhits=2 minratio=0.9 maxindel=20 trim=rl untrim=t in1=/tmp/487047.1.standard.q/inpu
t/Sample1_R1.fastq.gz in2=/tmp/487047.1.standard.q/input/Sample1_R2.fastq.gz ref=/pbtech_mounts/fdlab_store003/fdlab/genomes/human/hg19/indexes/star/hg19.fa,/pbtech_mounts/fdlab_store003/fdlab/genomes/mouse/mm10/genome/mm10.fa minratio=0.5 maxindel=500 minhits=1 basename=/tmp/487047.1.standard.q/dest/Sample1__%.fq ambig2=all local
Executing align2.BBSplitter [build=1, ow=t, fastareadlen=500, minhits=2, minratio=0.9, maxindel=20, trim=rl, untrim=t, in1=/tmp/487047.1.standard.q/input/Sample1_R1.fastq.gz, in2=/tmp/487047.1.standard.q/input/Sample1_R2.fastq.gz, ref=/pbtech_mounts/fdlab_store003/fdlab/genomes/human/hg19/indexes/star/hg19.fa,/pbtech_mounts/fdlab_store003/fdlab/genomes/mouse/mm1
0/genome/mm10.fa, minratio=0.5, maxindel=500, minhits=1, basename=/tmp/487047.1.standard.q/dest/Sample1__%.fq, ambig2=all, local]

Converted arguments to [build=1, ow=t, fastareadlen=500, minhits=2, minratio=0.9, maxindel=20, trim=rl, untrim=t, in1=/tmp/487047.1.standard.q/input/Sample1_R1.fastq.gz, in2=/tmp/487047.1.standard.q/input/Sample1_R2.fastq.gz, minratio=0.5, maxindel=500, minhits=1, basename=/tmp/487047.1.standard.q/dest/Sample1__%.fq, ambig2=all, local, ref_hg19=/pbtech_mounts/fd
lab_store003/fdlab/genomes/human/hg19/indexes/star/hg19.fa, ref_mm10=/pbtech_mounts/fdlab_store003/fdlab/genomes/mouse/mm10/genome/mm10.fa]
Creating merged reference file ref/genome/1/merged_ref_6708417211327.fa.gz
Ref merge time:    173.220 seconds.
Executing align2.BBMap [build=1, ow=t, fastareadlen=500, minhits=2, minratio=0.9, maxindel=20, trim=rl, untrim=t, in1=/tmp/487047.1.standard.q/input/Sample1_R1.fastq.gz, in2=/tmp/487047.1.standard.q/input/Sample1_R2.fastq.gz, minratio=0.5, maxindel=500, minhits=1, ambig2=all, local, ref=ref/genome/1/merged_ref_6708417211327.fa.gz, out_hg19=/tmp/487047.1.standard
.q/dest/Sample1__hg19.fq, out_mm10=/tmp/487047.1.standard.q/dest/Sample1__mm10.fq]

BBMap version 34.94
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.900
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.500
Retaining first best site only for ambiguous mappings.
NOTE: Deleting contents of ref/genome/1 because reference is specified and overwrite=true
Writing reference.
Executing dna.FastaToChromArrays2 [ref/genome/1/merged_ref_6708417211327.fa.gz, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]

Set genScaffoldInfo=true
Writing chunk 1
Writing chunk 2
Writing chunk 3
Writing chunk 4
Writing chunk 5
Writing chunk 6
Writing chunk 7
Writing chunk 8
Writing chunk 9
Writing chunk 10
Writing chunk 11
Writing chunk 12
Set genome to 1

Loaded Reference: 0.004 seconds.
Loading index for chunk 1-12, build 1
No index available; generating from reference genome: /tmp/487047.1.standard.q/ref/index/1/chr1-3_index_k13_c2_b1.block
Indexing threads started for block 0-3
Indexing threads finished for block 0-3
No index available; generating from reference genome: /tmp/487047.1.standard.q/ref/index/1/chr4-7_index_k13_c2_b1.block
Indexing threads started for block 4-7
Indexing threads finished for block 4-7
No index available; generating from reference genome: /tmp/487047.1.standard.q/ref/index/1/chr8-11_index_k13_c2_b1.block
Indexing threads started for block 8-11
Indexing threads finished for block 8-11
No index available; generating from reference genome: /tmp/487047.1.standard.q/ref/index/1/chr12_index_k13_c2_b1.block
Indexing threads started for block 12-12
Indexing threads finished for block 12-12
Generated Index: 717.384 seconds.
Analyzed Index:   15.050 seconds.
Reads that map to multiple references will be written to all relevant output streams.
Cleared Memory:   9.979 seconds.
Processing reads in paired-ended mode.
Started read stream.
Started 4 mapping threads.

Now,I am running with this command after adding threads,fast parameters.I think earlier I was running with one thread only.

software/bbmap/bbsplit.sh in1=$TMPDIR/input/${input_file} in2=$TMPDIR/input/${input_file%R1*}R2.fastq.gz ref=/pbtech_mounts/fdlab_store003/fdlab/genomes/human/hg19/indexes/star/hg19.fa,/pbtech_mounts/fdlab_store003/fdlab/genomes/mouse/mm10/genome/mm10.fa minratio=0.56 maxindel=500 minhits=1 basename=$TMPDIR/dest/${input_file%R1*}_%.fq ambig2=all local fast=<f> threads=<4>
ADD REPLYlink modified 6 months ago by RamRS20k • written 3.5 years ago by Ron910

Also is there any way to directly generate the compressed fastq files?

ADD REPLYlink written 3.5 years ago by Ron910

Output files will be written compressed if you add the suffix ".gz" to them:

basename=$TMPDIR/dest/${input_file%R1*}_%.fq

Also, are you sure you only have 4 cores? That's not very many for a machine with ~192 gigs of RAM. There's probably a better way but you can normally find out how many logical processors are present with cat /proc/cpuinfo

ADD REPLYlink modified 6 months ago by RamRS20k • written 3.5 years ago by Brian Bushnell16k

Since I am working on cluster.I am using -smp as well.

#!/bin/bash -l
#$ -j y
#$ -cwd -S /bin/sh
#$ -l h_vmem=45G
#$ -pe smp 4

Initially this was the reason that I did not added threads to parallelize.

ADD REPLYlink modified 6 months ago by RamRS20k • written 3.5 years ago by Ron910

Hi Brian,

So I increased number of threads to 16.The job has been running since yesterday almost 24 hours.The current file size is ~4 gb (human reads fastq) only and the input files are 50G(paired end) each.The parameters of my shell script on cluster are in the previous comment.I think this will take a lot of time.Let me know if you could assist in making this run faster.

ADD REPLYlink modified 6 months ago by RamRS20k • written 3.5 years ago by Ron910

I'm not sure why it's so slow... but, you can greatly speed it up like this:

seal.sh in=reads.fq pattern=out_#.fq ref=human.fa,mouse.fa ambig=all k=27 prealloc speed=8 outu=nonmatching.fq threads=16

This will require a little over 60GB RAM (total, not per-node) and run very fast, particularly if you give it 16 threads (Seal is usually around 100x faster than BBSplit, though it takes more memory). It works by counting kmer matches rather than alignment. You can then run BBSplit on just the unmapped reads if you want, or include them with human.

ADD REPLYlink modified 6 months ago by RamRS20k • written 3.5 years ago by Brian Bushnell16k
bbmap/seal.sh in1=$TMPDIR/input/${input_file} in2=$TMPDIR/input/${input_file%R1*}R2.fastq.gz ref=/pbtech_mounts/fdlab_store003/fdlab/genomes/human/hg19/indexes/star/hg19.fa,/pbtech_mounts/fdlab_store003/fdlab/genomes/mouse/mm10/genome/mm10.fa  pattern=$TMPDIR/dest/${input_file%R1*}_%.fq.gz ambig=all k=27 threads=32 prealloc speed=8

Using this I am getting these output files

Sample1__chr10.fq.gz
Sample1__chr11.fq.gz
Sample1__chr12.fq.gz
Sample1__chr13.fq.gz
Sample1__chr14.fq.gz
Sample1__chr15.fq.gz
Sample1__chr16.fq.gz
Sample1__chr17.fq.gz
Sample1__chr18.fq.gz
Sample1__chr19.fq.gz
Sample1__chr1.fq.gz
Sample1__chr20.fq.gz
Sample1__chr21.fq.gz
Sample1__chr22.fq.gz
Sample1__chr2.fq.gz
Sample1__chr3.fq.gz
Sample1__chr4.fq.gz
Sample1__chr5.fq.gz
Sample1__chr6.fq.gz
Sample1__chr7.fq.gz
Sample1__chr8.fq.gz
Sample1__chr9.fq.gz
Sample1__chrM.fq.gz
Sample1__chrX.fq.gz
Sample1__chrY.fq.gz

Am not sure if mouse reads are getting generated here(although the job is still running).

it seems pretty fast though.

ADD REPLYlink modified 6 months ago by RamRS20k • written 3.5 years ago by Ron910

Ah, I forgot about that. By default Seal splits on a per-reference-sequence basis. You can use the flag refnames=t to make one output file per reference file instead of one file per reference sequence.

ADD REPLYlink modified 6 months ago by RamRS20k • written 3.5 years ago by Brian Bushnell16k

Would I have to merge the fastq of all chromosomes in the end, or will generate a merge fastq for human and merged fastq for mouse?

ADD REPLYlink written 3.5 years ago by Ron910

With "refnames=t" you would get 2 fastq files, one for human and one for mouse.  Without that flag it will produce one file per chromosome and you'd need to merge them.  But, Seal makes an assumption that each sequence has a unique name when splitting them by sequence, so if your mouse and human references have sequences with the same name (e.g. "chr1") then you need to to re-run it with the "refnames=t" flag.

ADD REPLYlink written 3.5 years ago by Brian Bushnell16k

Hi Brian, 

Thanks much,this worked pretty fast.is there any way you would know of analyzing the mouse reads and getting read counts per gene?This is WGS data.I have done read counts from RNA-seq data using htseq-count.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Ron910

Hi Brian, is there a publication or a description of this tool available? I will try it in few days. Thanks a lot

ADD REPLYlink written 2.8 years ago by sdum20

There's no publication yet, but there's a description of it here. Also, when you download the BBMap package, there's a guide for Seal (which can be used similarly to BBSplit, but uses kmer-matching instead of mapping) in /bbmap/docs/guides/SealGuide.txt which gives a description of Seal. I haven't written one for BBSplit yet; I should do that.

ADD REPLYlink written 2.8 years ago by Brian Bushnell16k

Hi Brian, How can we output the unmapped reads that don't map to either human or mouse, for further analysis?I am using seal script in bbmap

ADD REPLYlink written 2.8 years ago by Ron910

Use the outu flag, e.g.:

seal.sh in=reads.fq pattern=mapped_%.fq outu=unmapped.fq ref=[references]

ADD REPLYlink written 2.8 years ago by Brian Bushnell16k

Thanks a lot. I try to use it now and I've a new question. Is it possible to use splitted fastq? Or I must concatenate them before?

Here is my command line I have used to try:

../bbsplit.sh in=/raw/fastq/141_CACTTCGA_L003_R#_*.fastq.gz ref=human.fa,mouse.fa basename=out_%_R#.fastq.gz outu1=unmapped1.fastq.gz outu2=unmapped2.fastq.gz threads=20

Thanks a lot in advance

ADD REPLYlink modified 6 months ago by RamRS20k • written 2.8 years ago by sdum20

That should work fine. Note that there is a difference between "concatenated" and "interleaved" fastq - never concatenate read1 and read2 files; that would break pairing. If you wanted to turn them into a single file, you would do something like this:

reformat.sh in=r#.fq out=interleaved.fq
ADD REPLYlink modified 6 months ago by RamRS20k • written 2.8 years ago by Brian Bushnell16k

Of course, I don't want to concatenate R1 and R2. But, for one sample, I have several fastq splitted for R1, and the same number for R2 (splitted by illumina software casava). I wondering if it is possible to run only one bbplit, but I will launch it several times, in parrallel, and then concatenate results files. Thanks again

ADD REPLYlink written 2.8 years ago by sdum20

Hi Brian, Is there any way to split ribosomal rna reads using this? Let me know.Can we do by including the fast file for rRNA?If yes,How can we get this file?

Thanks, Ron

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Ron910

Hi Ron,

To use BBSplit you must have at least two references; in this case, a ribosomal reference an another for the main genome without ribosomal sequences. If you only have the ribosomal sequence, you can just map to it with BBMap using fairly strict parameters and collect the unmapped reads:

bbmap.sh in=reads.fq ref=ribo.fa maxindel=1 minid=0.95 outm=ribo_reads.fq outu=unmapped.fq

Depending on the situation, if you have sufficient data, and you don't have a ribosomal reference, you could try to assemble the ribosomes on the assumption that they have much higher coverage than the rest. What kind of experiment are you doing, on what kind of organism, and how much coverage do you have? How long are the reads?

ADD REPLYlink written 2.5 years ago by Brian Bushnell16k

Hi Brian,

Just want to look for contamination ,we think that there are lot of reads present as rRNA in our human samples RNA seq data. Although we have sufficient amount of human reads ~40 million mapped but there are still lot of unmapped reads.The reads are 100 bp paired end.I don't have a rRNA fasta file so would have to assemble it.

ADD REPLYlink written 2.5 years ago by Ron910

I found this rRNA fasta file here: http://genomespot.blogspot.com/2015/08/screen-for-rrna-contamination-in-rna.html

Do you think this should be enough?

ADD REPLYlink written 2.5 years ago by Ron910
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1127 users visited in the last hour