Question: Remove hundreds of probes
0
gravatar for nbhardwaj
12 days ago by
nbhardwaj80
United States
nbhardwaj80 wrote:

Hi, I have paired-end fastq file which contain the sequences of the target regions along with probes sequences on either side (in entirety or partially). I have sequences of all the 3,200 probes. I want to quickly trim the probe sequences from the fastq files if they occur on either side of the target region (only if they start with or end with the probes).

Does anyone have a suggestion for a tool to do the above? I have tried Trimmomatic and Cutadapt but they are too slow as they are designed with only a handful of probes (or adapters) in mind?

Is there anyway to remove 1000's of probes that occur at the end of the reads in paired-end fastq files? Thanks!

trimming probes • 156 views
ADD COMMENTlink modified 12 days ago by chen1.1k • written 12 days ago by nbhardwaj80
0
gravatar for genomax
12 days ago by
genomax37k
United States
genomax37k wrote:

You can use bbduk.sh from BBMap suite. Provide the sequences of the probes in multi-fasta format in a file. You will have to do ktrim=lr if the sequences are on either side (or may have to do a two-pass trim).

ADD COMMENTlink written 12 days ago by genomax37k

Thanks, genomax. I did try bbduk.sh but it did not give me what I expected. I ran it to only look for probes on the 5' end with exact matches in the first 31 bp as follows:

bbduk.sh in1=L001_R1_001.fastq.gz in2=L001_R2_001.fastq.gz out1=bbmap_R1.fastq.gz out2=bbmap_R2.fastq.gz ref=RC_DLSO_and_ULSO.fa rcomp=f restrictleft=31 hdist=0 minkmerfraction=1.0 tbo

Added 8630 kmers; time:     0.225 seconds.
Memory: max=1457m, free=1396m, used=61m
Input is being processed as paired
Started output streams: 0.121 seconds.
Processing time:        14.583 seconds.
Input:                      1089512 reads       164516312 bases.
Contaminants:               0 reads (0.00%)     0 bases (0.00%)
Trimmed by overlap:         51452 reads (4.72%)     2452840 bases (1.49%)
Total Removed:              0 reads (0.00%)     2452840 bases (1.49%)
Result:                     1089512 reads (100.00%)     162063472 bases (98.51%)

As you can see, it did not remove any bases due to matches with the probes (although I checked that they are there) but only due to overlap. Moreover, my ref file (RC_DLSO_and_ULSO.fa) contains ~6,500 probes but the info above says that it added 8630 k-mers. So, I think that my ref file is not being read.

Then, I ran it with half the k-mers (only for R1):

bbduk.sh in1=R1_001.fastq.gz in2=R2_001.fastq.gz out1=bbmap_R1.fastq.gz out2=bbmap_R2.fastq.gz ref=RC_DLSO_without_start.fa rcomp=f restrictleft=31 hdist=0 minkmerfraction=1.0 tbo skipr2=t

Added 7164 kmers; time:     0.162 seconds.
Memory: max=1459m, free=1359m, used=100m
Input is being processed as paired
Started output streams: 0.123 seconds.
Processing time:        14.258 seconds.
Input:                      1089512 reads       164516312 bases.
Contaminants:               0 reads (0.00%)     0 bases (0.00%)
Trimmed by overlap:         51452 reads (4.72%)     2452840 bases (1.49%)
Total Removed:              0 reads (0.00%)     2452840 bases (1.49%)
Result:                     1089512 reads (100.00%)     162063472 bases (98.51%)

Above, it said it added 7164 k-mers intead of ~3250 in my ref file.

And it only took 14 secs for a file of 3,250 probes.

Can you help me figuring out with what is going on?

Thanks a lot in advance!

ADD REPLYlink modified 9 days ago • written 9 days ago by nbhardwaj80

Can you add k=13 ktrim=l and remove minkmerfraction=1.0 in your command line and re-try? Are your probe sequences in fasta format in that file?

ADD REPLYlink written 9 days ago by genomax37k

hi genomax, yes, that worked. What does k=13 mean?

Also, the reason why I had minmerfraction=1.0 is because I wanted to remove only those 5' parts that were 100% identical to the probe sequences. I hope that it still does that.

Yes, the probe sequences are in multi-fasta format. Thanks!

ADD REPLYlink written 7 days ago by nbhardwaj80

That is the k-mer size used for initial/seed matches. What do the stats look like?

ADD REPLYlink modified 7 days ago • written 7 days ago by genomax37k
0
gravatar for chen
12 days ago by
chen1.1k
OpenGene
chen1.1k wrote:

If you have a file contains all the probes, you can use buduk.sh from BBMap suite to remove them, as @genomax recommended.

But if you don't have a file contains all the probes, you can use fastp to remove all the probes, without the need to provide the probe sequences. fastp trims adapters (probes) by finding the insert size for paired end data, and treating the read tails beyond insert size as adapters. So fastp can do this without adapter sequences provided.

The command is simple:

fastp -i in.R1.fq.gz -I in.R2.fq.gz -o out.R1.fq.gz -O out.R2.fq.gz

This tool is very fast (written in C++, with multithreading supported), you can get it from: https://github.com/OpenGene/fastp

ADD COMMENTlink written 12 days ago by chen1.1k

fastp trims adapters (probes) by finding the insert size for paired end data, and treating the read tails beyond insert size as adapters

@chen: Would you mind clarifying how this feature works if the PE reads do not overlap? They should not, in majority of cases with well made libraries. You do not appear to be using a reference in the above command line either.

ADD REPLYlink modified 12 days ago • written 12 days ago by genomax37k

The trick is:

PE is not overlapped --> insert DNA is longer than sequencing length --> no adapters are sequenced.

The only issue is sequencing errors will cause mismatches in the overlapped region, but it can be handled by error tolerant algorithms.

ADD REPLYlink written 12 days ago by chen1.1k

Hi Chen, I do have the sequences of the probes but I still tried it nonetheless - it worked but it did not remove all the probes (only about half of them were removed). I think its because it does not have all the info (the exact sequences of all the probes).

ADD REPLYlink written 9 days ago by nbhardwaj80

How did you know a half of probes were removed and rest were kept? Can you post the table of adapter trimming result?

ADD REPLYlink written 9 days ago by chen1.1k

Hi chen, actually, it does work - I loaded it in a browser and it looked good. Thanks for the suggestion!

ADD REPLYlink written 7 days ago by nbhardwaj80

I am glad that it helps:)

ADD REPLYlink written 7 days ago by chen1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1370 users visited in the last hour