Question

Mapping to multiple references using bbmap

0

Entering edit mode

2.2 years ago

mwr21 • 0

So my question comes in two parts:

First of all is what I'm trying to do within reason given the tools I am using?

I am investigating the shuffling effects of a recombinase on a known reporter sequence which subsequently generates libraries of unique sequences. By simulating all of the possible outcomes through a python script I am given 384 unique permutations that could potentially arise from the shuffling of the cassette. I am attempting therefore to map my next-gen sequencing reads to this library of simulated sequences as a reference using bbmap(bbsplit.sh) to bin the reads according to their alignment to the respective references. I have simulated references as fasta and fastq files but am running into the following error:

bbsplit.sh ref=Rci4_shuffled_library.fasta in=Rci4_test_reads.fasta basename=o%.fastq outu1=cleaned.fastq
java -ea -Xmx8353m -Xms8353m -cp /home/********/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 ref=Rci4_shuffled_library.fasta in=Rci4_test_reads.fasta basename=o%.fastq outu1=cleaned.fastq
Executing align2.BBSplitter [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, ref=Rci4_shuffled_library.fasta, in=Rci4_test_reads.fasta, basename=o%.fastq, outu1=cleaned.fastq]

Converted arguments to [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, in=Rci4_test_reads.fasta, basename=o%.fastq, outu1=cleaned.fastq, ref_Rci4_shuffled_library=Rci4_shuffled_library.fasta]
Merged reference file ref/genome/1/merged_ref_9223372035509296106.fa.gz already exists; skipping merge.
Ref merge time:         0.036 seconds.
Exception in thread "main" java.lang.AssertionError: Invalid input file: 'Rci4_test_reads.fasta'
        at align2.AbstractMapper.preparse0(AbstractMapper.java:822)
        at align2.AbstractMapper.<init>(AbstractMapper.java:54)
        at align2.BBMap.<init>(BBMap.java:42)
        at align2.BBMap.main(BBMap.java:30)
        at align2.BBSplitter.main(BBSplitter.java:48)

So...

1) Is what I'm trying to do achievable with bbmap?

2) If so, why is it failing to recognise my input files? I have tried both the fasta and fastq format for my input files but am hit with the same error message. I have also tried increasing my heap size using -Xmx% but no matter what I select I am told it is an invalid maximum heap size. I tried to check my settings using:

java -XshowSettings:vm

which returns an (estimated) max heap size of 3.10G which is above the inputs I have tried.

P.S. I'm new to linux so please excuse my coding illiteracy. I'm running Ubuntu on a Windows PC if that helps too.

bbmap sequence mapping reference alignment • 1.8k views

ADD COMMENT • link 2.2 years ago by mwr21 • 0

1

Entering edit mode

By simulating all of the possible outcomes through a python script I am given 384 unique permutations that could potentially arise from the shuffling of the cassette. I am attempting therefore to map my next-gen sequencing reads to this library of simulated sequences as a reference using bbmap(bbsplit.sh) to bin the reads according to their alignment to the respective references.

If you wish to bin the reads using bbsplit.sh then you will need to provide those references as individual fasta files. Idea of using bbsplit is you provide multiple reference genomes and then get reads binned according to how they map. e.g. bbsplit.sh in=reads.fq ref=seq1.fa,seq2.fa..seq384.fa

You could simply align the data to your 384 read reference using plain bbmap.sh. That would give you an aligned data file. You could then run samtools idxstats to get number of reads aligned to each of your reference.

Choose one of the two methods above depending on your ultimate goal.

ADD REPLY • link 2.2 years ago by GenoMax 142k

0

Entering edit mode

So as for the first method, I'm trying to avoid manually inputting individual files as this is laborious and inefficient, especially if I am to be working with multiple variations of similar but distinct shuffled libraries. If there's no way to do this with bbsplit that's fair enough. I was originally using minimap2 instead but bbsplit caught my attention for being able to handle multiple reference sequences and bin them. I was under the impression this was possible by inputting a single database.

For the second option I may use minimap2 instead for the alignment which I do know accepts reference databases as single files and will try using idxstats to get read numbers. I should have mentioned, my ultimate goal is to output the reads aligned to each reference as individual files (per reference) that could then be directly input into a visualisation script which calls from a database to determine the block order in the shuffled cassete reads. So something like:

READ1: BLOCK1, BLOCK4, BLOCK3, BLOCK2, BLOCK5

READ2: BLOCK3, BLOCK2, BLOCK5, BLOCK4, BLOCK1

Which could easily be turned into a heatmap or something similar to visually represent the block positions and relative frequencies.

ADD REPLY • link 2.2 years ago by mwr21 • 0

0

Entering edit mode

not sure but it might be it fails because you enter fasta format and request fastq as output ?

if you say both types of input fail, can you post an extract of both here? first dozen lines of each should do.

ADD REPLY • link 2.2 years ago by lieven.sterck 15k