Question: Find_primer_sequence_and_reverse_complement_read_within_amplicon.fastq_file script needed
0
gravatar for karsten.liere
19 months ago by
karsten.liere0 wrote:

Hi,

I have a fastq file with merged amplicon sequence reads post Illumina sequencing. However, the library was build by ligating the Illumina adapters to the amplicon. Therefore, the merged amplicon sequences have no common direction. Now I had the idea to use a script that finds the exact sequence of a certain primer and reverse complements those sequences within the fastq file. One _is_ able to do this in Geneious, however, I'd rather like to use a script/tool to incorporate into the pipeline I want to establish. I already looked into BBtools and others, however, couldn't find a tool doing this.

Thank you for any input.

Cheers,

Karsten

sequencing next-gen • 537 views
ADD COMMENTlink written 19 months ago by karsten.liere0

Hello karsten.liere!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=80003

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 19 months ago by Pierre Lindenbaum122k

Hello Pierre,

Oh - I'm sorry about that. I had no idea that the two communities are as closely knit as they are. I'll keep it in mind. Thank you for the heads-up.

Cheers,

Karsten

ADD REPLYlink written 19 months ago by karsten.liere0
3
gravatar for genomax
19 months ago by
genomax70k
United States
genomax70k wrote:

exact sequence of a certain primer and reverse complements those sequences within the fastq file

Just that part of the sequence or the whole sequence? If whole sequence is fine then you should be able to select matching reads that contain the sequence you are interested in using bbduk.sh and then you can reverse-complement them using reformat.sh.

ADD COMMENTlink modified 19 months ago • written 19 months ago by genomax70k

Hi genomax,

ah, thank you for your idea. Got me going ;-). Indeed I'd like to rc the entire read. So here's the "workflow":

bbduk.sh in=in.fastq literal=ACACGACTACNVGGGTATCTAAKCC out=to_rc.fastq outm=not_to_rc.fastq
reformat.sh in=to_rc.fastq rcomp=f out=rced.fastq
cat not_to_rc.fastq, rced.fastq > in_trimmed_aligned.fastq

That's what you thought of, right?

However, the literal argument doesn't find the primer sequence. Any idea about that?

Thank you for your input!

Cheers,

Karsten

ADD REPLYlink modified 19 months ago • written 19 months ago by karsten.liere0
1

I think literal option uses exact sequence. You may need to make a file with all possible options for the sequences and provide that with ref=. Or you could look for just the invariant part at the beginning literal=ACACGACTAC and adjust k= accordingly.

For reformat.sh step, you need to use rcomp=t, if you want to reverse complement the reads.

ADD REPLYlink modified 19 months ago • written 19 months ago by genomax70k

I now used

bbduk.sh in=in.fastq literal=ACACGACTAC k=10 out=not_to_rc.fastq outm=to_rc.fastq
reformat.sh in=to_rc.fastq rcomp=f out=rced.fastq
cat not_to_rc.fastq rced.fastq > in_trimmed_aligned.fastq

and it worked nicely, thank you so much.

PS: Why did you suggest using rcomp=t instead of rcomp=f ?

ADD REPLYlink modified 19 months ago • written 19 months ago by karsten.liere0

I thought you wanted to reverse-complement those reads which contained the adapter and that was why I suggested setting rcomp to true. If the results look as what you expect them to be then you can go with what you have. I will move my original comment to an answer. You can accept it to provide closure to this thread.

ADD REPLYlink modified 19 months ago • written 19 months ago by genomax70k

I see - yes indeed. So there is a way to rc the reads within the original fastq file without writing the out=not_to_rc.fastq outm=to_rc.fastq files to disk as input files for reformat.sh? Right now bbduk.sh writes those two files to disk and reformat.sh takes the to_rc.fastq file, rcs it, and saves the resulting file again. That's why I use the cat command to concatenate the not_to_rc.fastq file from the bbduk.sh step and the rced.fastq file from the reformat.sh step into a single file.

Thank you for your patience,

k.

PS: And yes, question's solved - thanks again!

ADD REPLYlink modified 19 months ago • written 19 months ago by karsten.liere0
1

Answer to your question is you should never mess with original data files.

See if following does what you need without having to write at least one intermediate file:

bbduk.sh in=in.fastq literal=ACACGACTAC k=10 out=not_to_rc.fastq outm=stdout.fastq | reformat.sh in=stdin.fastq rcomp=f out=stdout.fastq | cat not_to_rc.fastq - > in_trimmed_aligned.fastq

If this is working as expected then add a final pipe | rm -f not_to_rc.fastq to the command above to leave no intermediate files.

bbduk.sh in=in.fastq literal=ACACGACTAC k=10 out=not_to_rc.fastq outm=stdout.fastq | reformat.sh in=stdin.fastq rcomp=f out=stdout.fastq | cat not_to_rc.fastq - > in_trimmed_aligned.fastq | rm -f not_to_rc.fastq
ADD REPLYlink modified 19 months ago • written 19 months ago by genomax70k

Thank you!

And yes - I _am_ stupid - f means obviously false not file. Oh dear. Sometimes....

So far I'm not really pleased with the bbduk side of this. It yet misses to many sequences. I'll play a bit around with it.

Cheers,

k.

ADD REPLYlink written 19 months ago by karsten.liere0

As long as you get the settings right bbduk.sh will definitely work. You may want to reduce that value of k= since contaminants smaller than 10 will not be found in your current command. Value can be as small as 1 but you have to be careful. Failing that you can always brute-force all possible combinations of sequence you expect.

ADD REPLYlink written 19 months ago by genomax70k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 553 users visited in the last hour