Trimming out primer sequences in the middle of reads
2
0
Entering edit mode
4.9 years ago

Hi!

I have PacBio reads that need to be assembled. These reads have Illumina primers at the both ends as well as in the middle. The problem is that the primer sequences vary and standard trimming cannot remove all the primers in the reads. My lab wants the assembled genome with the best quality, so I might have to write a script to detect the primers in the middle. I am currently thinking that I might want to remove sequences that are 80 ~ 100% similar to the primer sequences. But I am worried that this would also get rid of some informative sequences of the genome.

How do you guys deal with such situations?

genome • 3.3k views
2
Entering edit mode
4.9 years ago

I wrote a tool for removing internal PacBio adapter sequences, in the BBMap package:

removesmartbell in=reads.fq out=clean.fq split=t adapter=ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT


By default it uses the standard PacBio SmartBell adapters, but you can specify an Illumina adapter in this case. It uses indel-aware alignment designed to model PacBio's error rates of indels and substitutions, and has a very low false-positive rate. I don't remember the exact rate but I think it was around 1 in 5 megabases of PacBio sequence, or something like that. So it should not cause any problems downstream.

0
Entering edit mode

Hello, does your script also remove the reverse complement? Do I find it within the BBmap scripts?

1
Entering edit mode

You can include the RC sequence in adapter file or command line above.

removesmartbell in=reads.fq out=clean.fq split=t adapter=ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT,RC_Sequence

1
Entering edit mode
4.9 years ago

If the adapter sequence you give in input is long enough, say > 15 nt, it's unlikely you will throw away informative sequence (roughly speaking, of course).