Hi all! I am new to the world of bioinformatics. I just received my first set of sequences and am having some issues removing primers and adapters using cutadapt. For reference, I used the 515F Parada and 806R Apprill adapters to amplify the V4 region. I was told from the sequencing facility that the adapter used was CTGTCTCTTATACACATCT. Below is the code I used to run cutadapt for my samples. When I run FastQC on the trimmed sequences, I am not seeing any adapter removal besides polyA removal. The adapter content <5%, but I was under the impression that you want this to be as close to zero as possible. Does anyone know what I am doing wrong?
for r1 in *_R1.fastq; do
r2=${r1/_R1/_R2}
cutadapt --cores=4 \
-g ^GTGYCAGCMGCCGCGGTAA \
-G ^GGACTACNVGGGTWTCTAAT \
-a CTGTCTCTTATACACATCT \
-A CTGTCTCTTATACACATCT \
-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
-a A{10} \
-A A{10} \
--match-read-wildcards \
--overlap 5 \
--times 2 \
--minimum-length 50 \
--trim-n \
--report=full \
-o cutadapt_results/trimmed-${r1} \
-p cutadapt_results/trimmed-${r2} \
${r1} ${r2} > cutadapt_results/cutadapt_report-${r1%.fastq}.txt 2>&1
done
Is that part of a pipeline or something you came up yourself? If former, can you post a link for that?
A little bit of both. I used the cutadapt website (https://cutadapt.readthedocs.io/en/stable/recipes.html) to learn the syntax and parameters. The -g and -G were the primers I used (515F Parada and 806R Apprill). The first -a and -A are for the adapter provided to me by the sequencing facility. I added the second longer adapter because I saw "illuminia_universal'_adapter" on my FastQC report in the adapter content. I found the sequence online, but I do not think it is correct or I used it correctly because it failed to remove the "illuminia_universal_adapter". I also had polya artifacts, which were removed successfully with the -a A{10} -A{10}. I used the --match-read-wildcards because I have degenerate bases in my primers. I'm not sure if I need to use the reverse compliment for the 3' end though.
Update: I found the adapter list from FastQC (https://github.com/golharam/FastQC/blob/master/Configuration/adapter_list.txt) and used the sequences listed there to remove adapters using cutadapt with the updated script shown below:
I am still seeing overrepresented sequences in my post-cutadapt FastQC report. Could this mean the primers are not being removed? I have the --times 2 in attempts to remove primers after adapter removal, but I am not sure this is what I am supposed to do.
Isn't that expected since you are working with 16S (since the sequences are identical they would be "overrepresented").
Don't be concerned with FastQC report. Test intervals set for various tests are for genomic sequence (which is not your case). Move on with the analysis and if there is some notable issue then come back and try to diagnose.