Question: randomreads.sh simulated mates are far away
0
gravatar for grant.hovhannisyan
8 months ago by
grant.hovhannisyan1.4k wrote:

I am simulating test reads with randomreads.sh from BBMap package.

My command is

/Software/bbmap/randomreads.sh ref=C_glabrata_CBS138_current_chromosomes.fasta out1=read1.fastq out2=read2.fastq build=1 length=10 reads=2 coverage=-1 replacenoref=t simplenames=t seed=-1 paired=t metagenome=t addpairnum=t

The output is:

for read1

@0_+106297_106306_ChrM_C_glabrata_CBS138 (1402899 nucleotides) 1:
AGGTTTTAAT
+
989945?446
@1-_438159_438168_ChrJ_C_glabrata_CBS138 (1195129 nucleotides) 1:
ATATCTTCCT
+
AA?CB?@bcb`

for read2 :

@0_-761178_761187_ChrM_C_glabrata_CBS138 (1402899 nucleotides) 2:
AGCAGAGAGA
+
?>64844:64
@1+_646646_646655_ChrJ_C_glabrata_CBS138 (1195129 nucleotides) 2:
TGCCAGTTTC
+
??B@A?><@>

As far as I understand the reads @0 from read1 and @0 from read2 correspond to the read pair. However based on their coordinates they are super far away form each other. Is this a normal behaviour of the software?

Thanks

P.S. Hope Brian Bushnell will see this post :)

bbmap • 198 views
ADD COMMENTlink modified 8 months ago by h.mon24k • written 8 months ago by grant.hovhannisyan1.4k

Is there a reason you have metagenome=t set? 10 bp reads don't make much sense either.

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax64k

in this test case - no, there is no reason. But actually for my pipeline I will need that option to simulate RNASeq reads distribution. By the way, in case of large transcriptomes (like for human), that option extremely slows down the simulation speed. I guess it worth for a separate question.

ADD REPLYlink written 8 months ago by grant.hovhannisyan1.4k
1

Trying to simulate a metagenome (while providing a single reference) may be causing the strange behavior. Can you remove that option and see if this look more reasonable?

@Brian had these notes on metagenome simulations:

metagenome will assign each scaffold a random exponential variable, which decides the probability that a read be generated from that scaffold. So, if you concatenate together 20 bacterial genomes, you can run randomreads and get a metagenomic-like distribution. It could also be used for RNA-seq when using a transcriptome reference.

The coverage is decided on a per-reference-sequence level, so if a bacterial assembly has more than one contig, you may want to glue them together first with fuse.sh before concatenating them with the other references.

ADD REPLYlink written 8 months ago by genomax64k

Thanks @genomax. It seems that mininsert=0 solved the issue. From your comment, it also makes sense for me why it takes so long to simulate data for human transcriptome with metagenome=t. Do you have experience with large transcriptomes and randomreads.sh? For me it takes ca 15 min to simulate 100 reads.

ADD REPLYlink modified 8 months ago • written 8 months ago by grant.hovhannisyan1.4k

Did you try to simulate longer reads, like 100bp or 250bp?

ADD REPLYlink written 8 months ago by h.mon24k

For the question of my original post, so yes - with longer reads there are no problems, because apparently you always get a positive values for insert sizes. But for speed of simulating large transcriptomes, so I am trying now with longer reads and larger number of reads.

ADD REPLYlink written 8 months ago by grant.hovhannisyan1.4k
1
gravatar for h.mon
8 months ago by
h.mon24k
Brazil
h.mon24k wrote:

Indeed the paired reads seem really far apart. randomreads.sh help does not explain how read length controls default insert size, but you can control insert size with the parameters:

mininsert=          Controls minimum insert length.  Default depends on read length.
maxinsert=          Controls maximum insert length.  Default depends on read length.
triangle=t          Make a triangular insert size distribution.
flat=f              Make a roughly flat insert size distribution..
superflat=f         Make a perfectly flat insert size distribution.
gaussian=f          Make a bell-shaped insert size distribution, with 
                    standard deviation of (maxinsert-mininsert)/6.
ADD COMMENTlink written 8 months ago by h.mon24k

It seems I have found a reason. When you specify a very short read length (for example 10), the software prints to stdout this line insert size=-80-120. I think this negative insert size messes up everything. And you are right, it is not mentioned how the insert size is calculated based on read length.

EDIT: OK, it seems that setting mininsert=0 fixes the issue. Seems that its a bug.

ADD REPLYlink modified 8 months ago • written 8 months ago by grant.hovhannisyan1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1391 users visited in the last hour