Have bbduk retain quality scores in output after kmer trimming
1
0
Entering edit mode
3.1 years ago
caverill ▴ 40

I am trimming primers off of some .fastq files before processing them through the dada2 pipeline. dada2 wants primers already removed, but also wants quality scores. To trim primers I am using kmer trimming via bbduk and the following command:

bbmap/bbduk.sh in=input.fastq out=output.fastq literal=GTGYCAGCMGCCGCGGTAA ktrim=l k=10


The head of the input file looks like this:

@ERR1873185.1 AV103___MISEQ:377:000000000-ABUYP:1:1101:15575:1896/1
GTGTCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCGTGTAGGCGGGTCTTTTATTCAGGGGGGAAATGCCCAGGCTCAACCTTGGAACTGCCTTTGATACTGGAGATCTTGAGTCCGGGGGGGGTGAGTGGATTTTCGAGTGTAGGGGTGAAATTCGTAGATATTCGCAAGAAAACCAGTGGCGAAGGCGGCTCCCTGGCCCGGGACTGACGCTGGGACGCGAAAGCGGGGGGGGCAACGAGGCTTAGCACCCCCTGTGG
+
4FFGGGFFAFGH?EE2BEGHHGHG2EGGGG?EGHF?FA0FHFGGGCFE4FFGF1?/EEDHHHDE/>/EFD3B@////0<11111<<1?F@@--;A9/B99:AE?FFEBF.//9///BF//99B//9999;//9;/9/:B////;FD?B----@-9../B.B/;B/;9.:/9//;/;..-;BB/;;.99..//99/.9..-.//.;.../:;.-;-9..@@>-9.;./.;A9B---;A./9....-;A.9@=;--./.-;==----;A.-..--.//////;:---.//.
GTGCCAGCCGCCGCGGTAAGACGAACCGTGCGAACGTTGTTCGGAATCACTGGGCTTAAAGGGCCCGTAGGCGGGCTGTCAAGTCTGGGGTGAAATCCCGCGGCTCAACCGTGGAACTGCCTTAGATACTGACGGCCTCGAGGGAGGTAGGGGCGAGCGGAACTGTGAGTGGAGCGGTGAAATGCGTTGATATTCACAGGAACTCCGGTGGCGAAGGCGGCTCGCTGGCCCTCTTCTGACGCTGATGCGCGAAAGCTAGGGGAGCAAACGGGATTAGATACCCTGGTAG
+


The head of the output file looks like this:

>ERR1873185.3001 AV103___MISEQ:377:000000000-ABUYP:1:1101:6839:18042/1
TACGAAGGGGGCTAGCGTTGCTCGGAATCACTGGGCGTAAAGCGCACGTAGGCGGCTTTTTAAGTCAGGG
GTGAAATCCTGGAGCTCAACTCCAGAACTGCCTTTGATACTGGGAAGCTTGAGTCCGGGAGAGGTGAGTG
GAACTGCGAGTGTAGAGGTGAAATTCGTAGATATTCGCAAGAACACCAGTGGCGAAGGCGGCTCACTGGC
CCGGTACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGAAACCCTAGTAG
CACGTAGGGCGCAAGCGTTGTCCGGAATTATTGGGAGTAAAGAGCTCGTAGGCGGTCCGTCACGTCTGTT
GTGAAAATCCAGGGCTCAACCCTGGACTTGCTGCGGATACGGGCGGACTAGAGGTAGGTAGGGGAGAATG
GAATTCCCGGTGTAGCGGTGAAATGCGCAGATATCGGGAGGAACACCGGTGGCGAAGGCGGTTCTCTGGG
CCTTACCTGCCGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATTAGAAACCCGTGTAG


how can I tell bbduk to just trim the primer sequence and retain the quality scores? Furthermore, The output file sequences are in a different order, anyway to have them stay in the same order? I'm worried my current bbduk call is also quality filtering/discarding which I do not want. Its difficult to tell from the file line counts because the sequences in the output of bbduk take up multiple lines in the output, rather than a single line.

trim primer bbmap • 996 views
4
Entering edit mode
3.1 years ago
GenoMax 107k

That does not look correct.

BBTools will reformat output sequences/adjust input parameters based on file names (.fa is interpreted as fasta format needed/provided). It appears that your trimmed output is getting converted to fasta format even though the command you show above is for fastq format output. Can you address this discrepancy?

Note: You will need to include N's in literal= option (instead of IUPAC code) to cover all different possibilities.
Note 2: Add ordered=t to keep the output reads in same order as input.

0
Entering edit mode

This helps! Yes, my output line originally was seqs.fna, I changed this when i was shortening the filepaths for this example, I didn't realize bbduk would use that as input to the function.

Where can I find that nuance in the bbduk documentation? I don't see it here: https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/

0
Entering edit mode

It is up some items: Usage Guide.

File Formats and Extensions

BBTools support most standard sequence formats, including fastq, fasta, fasta+qual, scarf, sam, and (if samtools is installed) bam. They also support gzip and (if bzip2 or pbzip2 is installed) bzip2. The tools are sensitive to file extensions.

0
Entering edit mode

Also- bbduk can handle other degenerate base names than N. From bbduk documentation:

Matching degenerate sequences such as primers: bbduk.sh in=reads.fq out=matching.fq literal=ACGTTNNNNNGTC copyundefined k=13 mm=f

This will clone the reference sequences to represent every possibility for the degenerate bases (Ns and other non-ACGT IUPAC symbols).

0
Entering edit mode

One last question- the number of lines in the output file is fewer than the number of lines in the input file, implying some sequences have been removed. Specifically what are the criteria for removing a sequence from the output? As far as I can understand from the documentation this command trims primers, I have no idea under what conditions it removes sequences- if the k-mer sequence is not found within a read? something else?

1
Entering edit mode

If you had a primer dimer in the read then that would be almost certainly be thrown out. Other possibility is minlen=10 by default. If your read becomes shorter 10 bp after it gets trimmed then it will be removed. If a k-mer is not found in the read nothing should happen to it.