Question: Remove hundreds of probes
0
gravatar for nbhardwaj
10 months ago by
nbhardwaj110
United States
nbhardwaj110 wrote:

Hi, I have paired-end fastq file which contain the sequences of the target regions along with probes sequences on either side (in entirety or partially). I have sequences of all the 3,200 probes. I want to quickly trim the probe sequences from the fastq files if they occur on either side of the target region (only if they start with or end with the probes).

Does anyone have a suggestion for a tool to do the above? I have tried Trimmomatic and Cutadapt but they are too slow as they are designed with only a handful of probes (or adapters) in mind?

Is there anyway to remove 1000's of probes that occur at the end of the reads in paired-end fastq files? Thanks!

trimming probes • 393 views
ADD COMMENTlink modified 10 months ago by chen1.7k • written 10 months ago by nbhardwaj110
0
gravatar for genomax
10 months ago by
genomax55k
United States
genomax55k wrote:

You can use bbduk.sh from BBMap suite. Provide the sequences of the probes in multi-fasta format in a file. You will have to do ktrim=lr if the sequences are on either side (or may have to do a two-pass trim).

ADD COMMENTlink written 10 months ago by genomax55k

Thanks, genomax. I did try bbduk.sh but it did not give me what I expected. I ran it to only look for probes on the 5' end with exact matches in the first 31 bp as follows:

bbduk.sh in1=L001_R1_001.fastq.gz in2=L001_R2_001.fastq.gz out1=bbmap_R1.fastq.gz out2=bbmap_R2.fastq.gz ref=RC_DLSO_and_ULSO.fa rcomp=f restrictleft=31 hdist=0 minkmerfraction=1.0 tbo

Added 8630 kmers; time:     0.225 seconds.
Memory: max=1457m, free=1396m, used=61m
Input is being processed as paired
Started output streams: 0.121 seconds.
Processing time:        14.583 seconds.
Input:                      1089512 reads       164516312 bases.
Contaminants:               0 reads (0.00%)     0 bases (0.00%)
Trimmed by overlap:         51452 reads (4.72%)     2452840 bases (1.49%)
Total Removed:              0 reads (0.00%)     2452840 bases (1.49%)
Result:                     1089512 reads (100.00%)     162063472 bases (98.51%)

As you can see, it did not remove any bases due to matches with the probes (although I checked that they are there) but only due to overlap. Moreover, my ref file (RC_DLSO_and_ULSO.fa) contains ~6,500 probes but the info above says that it added 8630 k-mers. So, I think that my ref file is not being read.

Then, I ran it with half the k-mers (only for R1):

bbduk.sh in1=R1_001.fastq.gz in2=R2_001.fastq.gz out1=bbmap_R1.fastq.gz out2=bbmap_R2.fastq.gz ref=RC_DLSO_without_start.fa rcomp=f restrictleft=31 hdist=0 minkmerfraction=1.0 tbo skipr2=t

Added 7164 kmers; time:     0.162 seconds.
Memory: max=1459m, free=1359m, used=100m
Input is being processed as paired
Started output streams: 0.123 seconds.
Processing time:        14.258 seconds.
Input:                      1089512 reads       164516312 bases.
Contaminants:               0 reads (0.00%)     0 bases (0.00%)
Trimmed by overlap:         51452 reads (4.72%)     2452840 bases (1.49%)
Total Removed:              0 reads (0.00%)     2452840 bases (1.49%)
Result:                     1089512 reads (100.00%)     162063472 bases (98.51%)

Above, it said it added 7164 k-mers intead of ~3250 in my ref file.

And it only took 14 secs for a file of 3,250 probes.

Can you help me figuring out with what is going on?

Thanks a lot in advance!

ADD REPLYlink modified 10 months ago • written 10 months ago by nbhardwaj110

Can you add k=13 ktrim=l and remove minkmerfraction=1.0 in your command line and re-try? Are your probe sequences in fasta format in that file?

ADD REPLYlink written 10 months ago by genomax55k

hi genomax, yes, that worked. What does k=13 mean?

Also, the reason why I had minmerfraction=1.0 is because I wanted to remove only those 5' parts that were 100% identical to the probe sequences. I hope that it still does that.

Yes, the probe sequences are in multi-fasta format. Thanks!

ADD REPLYlink written 10 months ago by nbhardwaj110

That is the k-mer size used for initial/seed matches. What do the stats look like?

ADD REPLYlink modified 10 months ago • written 10 months ago by genomax55k
0
gravatar for chen
10 months ago by
chen1.7k
OpenGene
chen1.7k wrote:

If you have a file contains all the probes, you can use buduk.sh from BBMap suite to remove them, as @genomax recommended.

But if you don't have a file contains all the probes, you can use fastp to remove all the probes, without the need to provide the probe sequences. fastp trims adapters (probes) by finding the insert size for paired end data, and treating the read tails beyond insert size as adapters. So fastp can do this without adapter sequences provided.

The command is simple:

fastp -i in.R1.fq.gz -I in.R2.fq.gz -o out.R1.fq.gz -O out.R2.fq.gz

This tool is very fast (written in C++, with multithreading supported), you can get it from: https://github.com/OpenGene/fastp

ADD COMMENTlink written 10 months ago by chen1.7k

fastp trims adapters (probes) by finding the insert size for paired end data, and treating the read tails beyond insert size as adapters

@chen: Would you mind clarifying how this feature works if the PE reads do not overlap? They should not, in majority of cases with well made libraries. You do not appear to be using a reference in the above command line either.

ADD REPLYlink modified 10 months ago • written 10 months ago by genomax55k

The trick is:

PE is not overlapped --> insert DNA is longer than sequencing length --> no adapters are sequenced.

The only issue is sequencing errors will cause mismatches in the overlapped region, but it can be handled by error tolerant algorithms.

ADD REPLYlink written 10 months ago by chen1.7k

Hi Chen, I do have the sequences of the probes but I still tried it nonetheless - it worked but it did not remove all the probes (only about half of them were removed). I think its because it does not have all the info (the exact sequences of all the probes).

ADD REPLYlink written 10 months ago by nbhardwaj110

How did you know a half of probes were removed and rest were kept? Can you post the table of adapter trimming result?

ADD REPLYlink written 10 months ago by chen1.7k

Hi chen, actually, it does work - I loaded it in a browser and it looked good. Thanks for the suggestion!

ADD REPLYlink written 10 months ago by nbhardwaj110

I am glad that it helps:)

ADD REPLYlink written 10 months ago by chen1.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1440 users visited in the last hour