Hi everyone. I have paired-end transcriptome fastq files from 44 samples and I'm hoping to build de novo assemblies for each sample. The problem is there seems to be a fair amount of contamination in many of our samples that represents bacterial, human, and rodent sequence (our species is a teleost fish, btw).
I discovered this contamination by first mapping to a reference genome (incomplete) for DE analysis then building a de novo transcriptome with Trinity from the pooled unmapped reads to find genes from genome gaps. Contaminant sequences were enriched in this assembly to the point where we accidentally built a partial mouse transcriptome.
For a follow up analysis, I want to build de novo transcriptomes for each sample to compare certain gene variants between experimental groups, but I want to remove contamination first so I don't get weird hybrid contigs from orthologous genes across species. At this point I know exactly which contigs from the unmapped read assembly represents contamination and I know the reads from the unmapped read pool that were used to build these contigs. Now I want to remove these reads from the respective samples prior to building sample-specific transcriptome assemblies.
I'm trying to using the filterbyname.sh script from BBmap with the command below, but it's running extremely slowly (4 days and counting for a single sample).
filterbyname.sh in=read1.fastq in2=read2.fastq out=filt_read1.fastq out2=filt_read2.fastq names=non-fish_reads.txt include=f substring=name
The "non-fish_reads.txt" is a list of about 22 million unique fastq headers from across all 44 samples that represent apparent contaminants. I wish I had sample-specific contaminant lists, but they came from the pooled assembly and I'm assuming finding which sample they belong to may take almost as much time as simultaneously filtering them.
Can anyone recommend possible faster workarounds or alternative strategies to BBmap filtering? I was thinking of aligning raw fastq files to the contaminant contigs at low stringency and exporting unmapped reads for de novo assembly, but I worry that this method might still miss contaminant reads with overhangs, base miscalls, etc. It would also work if I could feed this read list directly to an assembler as the reads to ignore, but I can't find that option for Trinity.
Thanks in advance.