Question: Have bbduk retain quality scores in output after kmer trimming
0
gravatar for caverill
3 months ago by
caverill40
caverill40 wrote:

I am trimming primers off of some .fastq files before processing them through the dada2 pipeline. dada2 wants primers already removed, but also wants quality scores. To trim primers I am using kmer trimming via bbduk and the following command:

bbmap/bbduk.sh in=input.fastq out=output.fastq literal=GTGYCAGCMGCCGCGGTAA ktrim=l k=10

The head of the input file looks like this:

@ERR1873185.1 AV103___MISEQ:377:000000000-ABUYP:1:1101:15575:1896/1
GTGTCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCGTGTAGGCGGGTCTTTTATTCAGGGGGGAAATGCCCAGGCTCAACCTTGGAACTGCCTTTGATACTGGAGATCTTGAGTCCGGGGGGGGTGAGTGGATTTTCGAGTGTAGGGGTGAAATTCGTAGATATTCGCAAGAAAACCAGTGGCGAAGGCGGCTCCCTGGCCCGGGACTGACGCTGGGACGCGAAAGCGGGGGGGGCAACGAGGCTTAGCACCCCCTGTGG
+
4FFGGGFFAFGH?EE2BEGHHGHG2EGGGG?EGHF?FA0FHFGGGCFE4FFGF1?/EEDHHHDE/>/EFD3B@////0<11111<<1?F@@--;A9/B99:AE?FFEBF.//9///BF//99B//9999;//9;/9/:B////;FD?B----@-9../B.B/;B/;9.:/9//;/;..-;BB/;;.99..//99/.9..-.//.;.../:;.-;-9..@@>-9.;./.;A9B---;A./9....-;A.9@=;--./.-;==----;A.-..--.//////;:---.//.
@ERR1873185.2 AV103___MISEQ:377:000000000-ABUYP:1:1101:17496:1920/1
GTGCCAGCCGCCGCGGTAAGACGAACCGTGCGAACGTTGTTCGGAATCACTGGGCTTAAAGGGCCCGTAGGCGGGCTGTCAAGTCTGGGGTGAAATCCCGCGGCTCAACCGTGGAACTGCCTTAGATACTGACGGCCTCGAGGGAGGTAGGGGCGAGCGGAACTGTGAGTGGAGCGGTGAAATGCGTTGATATTCACAGGAACTCCGGTGGCGAAGGCGGCTCGCTGGCCCTCTTCTGACGCTGATGCGCGAAAGCTAGGGGAGCAAACGGGATTAGATACCCTGGTAG
+
EG5FGG4FFEEG2E?2E?FDBFEEGEGEFHEEE0EEHF1FGHFCEEFHGHH4B1FEFBBGGH?F/>/EEAEBECGGGGHHBBBGHHFFHCGCHHHHHHHFGG?D@FHGHGBAEGGGGGGGGGG0BCBFGFGBFD@BBFFFAA->.A-.//.AD?BB-;??9BBFBBB/;FFFFFFB<.;9FFF/BD..99BFFFFBFBBEFFFFFFFBBB.9;>B??A>;>9A--..-.AEFFFFFFF/B?DFD;FFF;.9-@-ABF/F???A-:;FFFF...;B//;//9F99BEFF:

The head of the output file looks like this:

>ERR1873185.3001 AV103___MISEQ:377:000000000-ABUYP:1:1101:6839:18042/1
TACGAAGGGGGCTAGCGTTGCTCGGAATCACTGGGCGTAAAGCGCACGTAGGCGGCTTTTTAAGTCAGGG
GTGAAATCCTGGAGCTCAACTCCAGAACTGCCTTTGATACTGGGAAGCTTGAGTCCGGGAGAGGTGAGTG
GAACTGCGAGTGTAGAGGTGAAATTCGTAGATATTCGCAAGAACACCAGTGGCGAAGGCGGCTCACTGGC
CCGGTACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGAAACCCTAGTAG
>ERR1873185.3002 AV103___MISEQ:377:000000000-ABUYP:1:1101:13081:18045/1
CACGTAGGGCGCAAGCGTTGTCCGGAATTATTGGGAGTAAAGAGCTCGTAGGCGGTCCGTCACGTCTGTT
GTGAAAATCCAGGGCTCAACCCTGGACTTGCTGCGGATACGGGCGGACTAGAGGTAGGTAGGGGAGAATG
GAATTCCCGGTGTAGCGGTGAAATGCGCAGATATCGGGAGGAACACCGGTGGCGAAGGCGGTTCTCTGGG
CCTTACCTGCCGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATTAGAAACCCGTGTAG

how can I tell bbduk to just trim the primer sequence and retain the quality scores? Furthermore, The output file sequences are in a different order, anyway to have them stay in the same order? I'm worried my current bbduk call is also quality filtering/discarding which I do not want. Its difficult to tell from the file line counts because the sequences in the output of bbduk take up multiple lines in the output, rather than a single line.

primer trim bbmap • 189 views
ADD COMMENTlink written 3 months ago by caverill40
4
gravatar for genomax
3 months ago by
genomax59k
United States
genomax59k wrote:

That does not look correct.

BBTools will reformat output sequences/adjust input parameters based on file names (.fa is interpreted as fasta format needed/provided). It appears that your trimmed output is getting converted to fasta format even though the command you show above is for fastq format output. Can you address this discrepancy?

Note: You will need to include N's in literal= option (instead of IUPAC code) to cover all different possibilities.
Note 2: Add ordered=t to keep the output reads in same order as input.

ADD COMMENTlink modified 3 months ago • written 3 months ago by genomax59k

This helps! Yes, my output line originally was seqs.fna, I changed this when i was shortening the filepaths for this example, I didn't realize bbduk would use that as input to the function.

Where can I find that nuance in the bbduk documentation? I don't see it here: https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/

ADD REPLYlink written 3 months ago by caverill40

It is up some items: Usage Guide.

File Formats and Extensions

BBTools support most standard sequence formats, including fastq, fasta, fasta+qual, scarf, sam, and (if samtools is installed) bam. They also support gzip and (if bzip2 or pbzip2 is installed) bzip2. The tools are sensitive to file extensions.

ADD REPLYlink written 3 months ago by h.mon22k

Also- bbduk can handle other degenerate base names than N. From bbduk documentation:

Matching degenerate sequences such as primers: bbduk.sh in=reads.fq out=matching.fq literal=ACGTTNNNNNGTC copyundefined k=13 mm=f

This will clone the reference sequences to represent every possibility for the degenerate bases (Ns and other non-ACGT IUPAC symbols).

ADD REPLYlink written 3 months ago by caverill40

One last question- the number of lines in the output file is fewer than the number of lines in the input file, implying some sequences have been removed. Specifically what are the criteria for removing a sequence from the output? As far as I can understand from the documentation this command trims primers, I have no idea under what conditions it removes sequences- if the k-mer sequence is not found within a read? something else?

ADD REPLYlink written 3 months ago by caverill40
1

If you had a primer dimer in the read then that would be almost certainly be thrown out. Other possibility is minlen=10 by default. If your read becomes shorter 10 bp after it gets trimmed then it will be removed. If a k-mer is not found in the read nothing should happen to it.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax59k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1734 users visited in the last hour