1
0
Entering edit mode
6 weeks ago

Hi Guys,

I am evaluating quality of multiple lane paired end FASTQ file using FastQC tool. It is showing Adapter content graph as Illumina Universal Adapter (screenshot attached). But in the read sequence itself (150 length), I could not find any adapter sequence.

What should I do in that case? I will highly appreciate your suggestions.

0
Entering edit mode

Hi Guys,

Thank you so much for the nice answers. I attempted to cut adapters via cutadapt using command below

cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -o cut_Trueseq.fastq.gz input_L001_R1_001.fastq.gz

The fastq was successfully generated and evaluated using FASTQC tool again. I noticed that the reads before trimming were having sequence length of 151 but after trimming length varied from read to read based on position at which adapter was find and trimmed. This is reflected from the graph of sequence length distribution (trim.fastq; attached) which is no longer same as the original graph (original.fastq; attached). Therefore, I am wondering about the quality of my fastq file as FASTQC is giving red X sign on sequence length distribution.

1
Entering edit mode

FastQC Sequence Length Distribution module is set to assume all your sequences should have the same length. You can ignore the red X since you know you trimmed the sequences and therefore your sequence length is no longer uniform.

0
Entering edit mode

Agreed. The fastqc now looks perfectly fine and you can proceed with downstram analysis.

0
Entering edit mode

Hi Guys,

Thank you so much for the nice suggestions.

I also noticed that in some of my FASTQ files, FASTQC is showing presence of overrepresented sequences. How to handle this, whether they need to be removed ? If yes, how can I remove these sequences from my FASTQ files.

Thank You!!

0
Entering edit mode

Poly-G reads represent cluster producing no signal in two-color chemistry Illumina sequencers like NovaSeq,/NextSeq. This reads can be removed at the time of trimming. e.g. trimpolyg=0 with bbduk.sh.

3
Entering edit mode
6 weeks ago
GenoMax 111k

This is an aggregate plot of (millions) of reads. There are reads in your dataset that contain the adapter sequence. You are not going to "find" adapter sequences unless you use a dedicated program designed to find (and then trim) the adapters. I recommend bbduk (GUIDE) , fastp, cutadapt or trimmomatic.

0
Entering edit mode

Hi,

I just checked via terminal using below command for the presence of adapter sequence CTGTCTCTTATACACATCT

@NB501767:70:HTGTHBGXH:1:11101:24875:1041 1:N:0:GTCTGTCA GTATTNGAGTAAATAAGCCACTCTAATGCTTCANNTNNTGTNTTNAGATANGGNTNNNGGNCAGCAGAAAATTCAGTCTCAGCAGAAAAATGCCAAAAAGCAAGCTGGNNANAAGAAGAAACANGGANATGACCAAAAGAGATCGGAAGAG + AA6AA#EEAEEEEEEEEEEEE6EEAEAAAEE//##E##EEE#EE#EEEEE#EE#E###EE#A/EEEEE6EEEEEEEEEEEEEEEAEEEEEEA/EEEEEEEEEEEEEEA##A#EEEEEEEEAEE#/E<#EEAEAAEEEAEEEEEEEEEA<<A

@NB501767:70:HTGTHBGXH:1:11101:4941:1041 1:N:0:GTCTGTCA AACCCNTAAAATATGGGGTTGACACAAGTGGATNNCNNTGCNGTNAGGTGNCANANNNGGAATAACAGATTGTGGTTGCAGGTAGCAATGATCTGATGATTCCAATCANNCNCAGTGTTAAAGNTGGNAAGAGGGAGCCAGCAGACTGCAA + AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEE##E##EEE#EE#EEEEE#EE#E###EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEAEEE##A#AEEEEEEEEEE#AEE#/EEEEEAEAEEEAEEAAEAEE</

@NB501767:70:HTGTHBGXH:1:11101:2306:1042 1:N:0:GTCTGTCA GCCTGNGGTGGAGATCGGCATCAGTCAGCGCCANNATGCTGNGCNGCATGNCACCNANCTCATGTCCACCGAGCCTCTGCCCCGCACTCTGTCCCCGACTCCAGCTTCNNCNACAGCTCCACCCTCCNAGGGCATCCCGACATCAGATGAG + AAAAA#EE/EEAEAEEEEEA/EEAE/6EEEEEE##EEEEEE#/E#/E/EE#E/EE#/#E/EEAE/EE/EEE//EEEEEEEEAAEEEEEEEEEEEEEA/EEEE//E//E##E#/EAEE/AE</EAEE<#/E<EE/<AEAE/<A/EE/E//A6

@NB501767:70:HTGTHBGXH:1:11101:16511:1042 1:N:0:GTCTGTCA GTGTTNGGCTGACCTCATGGGGTGCAGGTGGAANNGAGGGGNCTNCAGCTNGCCCNGNTCAATGGCAGTGATGCCCAGGGCCCAGAGATCGGAAGAGCACACGTCTGAANTNCAGTCACGTCTGTCANTCTCGTATGCCGTCTTCTGCTTT + AAAAA#AAAEAEEEAEEEEEEEE/EEE/E/AEE##AEEAA6#EA#EE/E<#<E66#/#E<EE/<A<AE/<E<EEAEA/EEAAEEAEEEEEAEEEE<EEEEEE/AEE//E#<#<AAEEAE<<A/AEAE#//A66//EE///E//E//66/</

@NB501767:70:HTGTHBGXH:1:11101:11415:1042 1:N:0:GTCTGTCA CCCTGNCTCTACTAAAAAATACAAAAAAAAAAANNAGCTAGGTGNGGTGGNGGGCNCNTTGGCATTGGGGTCTTGCTTGAGATGGGCATCCAGCACTGCATCACTGATCNGNTCACAGATCTTATCTNGACAACAGCAAATATGGGTCAGA + AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEE##EEEEEEEEE#<EEEE#EEEA#E#A/EE//A/EEAA<AEAEAE/E/E/EEEEE/<EE/E<///EE/AE/AEEAE/#/#EEEA/E</</6/6AE#AEE/AE/EE/A/EEEE/E<AAA6

But, I could not find any adapter sequence at 3' or 5' end in these read sequences. In fact, I did not find any similar sequence at the end of these sequences (which could have been adapter sequence).

0
Entering edit mode

Adapters are in about 35% of your reads, so checking 20 out of millions is likely to not pick them up. There is also no need to do that manually. Just run the fastq files through any of the mentioned tools, this is super common and established, like using a pipet in the lab. Don't overthink it.

0
Entering edit mode

Examining a few reads is not enough. If you look at the plot above ~40% of your reads are going to have adapter. So trim the data using one of the programs above. Re-examine the trimmed data with FastQC. If you use the correct adapter sequence then that trace for Illumina adapter should disappear after trimming.

0
Entering edit mode

If you seriously want to look for your sequence, you need to look for it with grep.

zgrep CTGTCTCTTATACACATCT sample.L001_R1_001.fastq.gz | head

0
Entering edit mode

Hi,

Thank you so much for nice response. I tried trimming adaptor sequence and overrepresented sequence (PolyG) via bbduk using following command. But the adapter does not seem to be completely removed (graph attached). The overrepresented sequences seems to be removed. Is this command appropriate?

bbduk.sh -Xmx1g in1=S8_R2.fastq out1=S8_R2-trimmed-over-removed.fastq literal="AGATCGGAAGAGCGTCGTGTAGGGA" qtrim=rl trimq=20 ktrim=r k=25 filterpolyg=5

Executing jgi.BBDuk [-Xmx1g, in1=S8_R2.fastq, out1=S8_R2-trimmed-over-removed.fastq, literal=AGATCGGAAGAGCGTCGTGTAGGGA, qtrim=rl, trimq=20, ktrim=r, k=25, filterpolyg=5] Version 38.94

0.037 seconds. Initial: Memory: max=1073m, total=1073m, free=1044m, used=29m

Added 1 kmers; time: 0.004 seconds. Memory: max=1073m, total=1073m, free=1041m, used=32m

Input is being processed as unpaired Started output streams: 0.010 seconds. Processing time: 60.032 seconds.

Input: 66911845 reads 10103688595 bases. QTrimmed: 34321051 reads (51.29%) 1136130760 bases (11.24%) Polymer-trimmed: 460683 reads (0.69%) 68394918 bases (0.68%) KTrimmed: 16270705 reads (24.32%) 780566967 bases (7.73%) Total Removed: 1665051 reads (2.49%) 1985092645 bases (19.65%) Result: 65246794 reads (97.51%) 8118595950 bases (80.35%)

Time: 60.051 seconds. Reads Processed: 66911k 1114.26k reads/sec Bases Processed: 10103m 168.25m bases/sec

0
Entering edit mode
bbduk.sh -Xmx1g in1=S8_R2.fastq out1=S8_R2-trimmed-over-removed.fastq literal="AGATCGGAAGAGCGTCGTGTAGGGA" qtrim=rl trimq=20 ktrim=r k=25 filterpolyg=5


If you have paired-end data then you should not be trimming R1/R2 files individually. Doing so can get the reads out of sync which will cause problems with alignments. Try the following command instead.

bbduk.sh -Xmx1g in1=S8_R1.fastq in2=S8_R2.fastq out1=S8_R1-trimmed-over-removed.fastq out2=S8_R2-trimmed-over-removed.fastq literal=AGATCGGAAGAGCGTCGTGTAGGGA qtrim=rl trimq=20 ktrim=r k=13 filterpolyg=5 tbo tpe


Value of k should be set to less than half the length of the sequence you are trying to identify/trim. With PE data using options tbo tpe will remove residual extraneous bases down to a single base bp. There is not need to use quotes around sequence in literal option.

0
Entering edit mode

Hi Genomax,

That is great. I tried this over some of the FASTQ files and it is working perfectly fine. Thank you so much.

In order to proceed further, I need to align read1 and read2 fastq files to the reference genome. I was wondering which of the alignment method present in bwa should I refer? https://hcc.unl.edu/docs/applications/app_specific/bioinformatics_tools/alignment_tools/bwa/running_bwa_commands/

0
Entering edit mode

For most purposes bwa mem should be all you need after you create the necessary reference index.

0
Entering edit mode

Hi Genomax,

I run alignment using bwa mem using following command:

bwa mem GRCh37_latest_genomic.fa R1.fastq R2.fastq | samtools view -h -b -S -o aln_trimmed.bam

Then, I checked bam file using Picard tool as per following command:

java -jar \$Picard ValidateSamFile I=aln_trimmed.bam MODE=SUMMARY

Read groups are important for downstream analysis, e.g in running GATK. What should I do to get bam file with read group added? Please suggest.

0
Entering edit mode
0
Entering edit mode

samtools addreplacerg -r '@RG\tID:2021_501\tLB:2021_501\tPL:ILLUMINA\tSM:2021_501' -m overwrite_all -o aln_2021-501_trimmed-RG.bam aln_2021-501_trimmed.bam

Then, checked it using command as:

samtools view -H aln_2021-501_trimmed-RG.bam | grep '^@RG'

@RG ID:2021_501 LB:2021_501 PL:ILLUMINA SM:2021_501

Is this right? I mean I am not sure about LB and SM parameters.

Also, when I am creating bam index using samtools as:

samtools index aln_2021-501_trimmed-RG.bam aln_2021-501_trimmed-RG.bai

I got error as:

*[E::hts_idx_push] NO_COOR reads not in a single block at the end 9 -1

[E::sam_index] Read 'NB501767:73:HTGMTBGXH:1:11101:18855:1059' with ref_name='NC_000006.11', ref_length=171115067, flags=121, pos=67325270 cannot be indexed

samtools index: failed to create index for "aln_2021-501_trimmed-RG.bam"*

0
Entering edit mode

Hi,

samtools index aln_2021-502_trimmed.bam aln_2021-502_trimmed.bam.bai

[E::hts_idx_push] NO_COOR reads not in a single block at the end 27 -1 [E::sam_index] Read 'NB501767:75:HTHG2BGXH:1:11101:21020:1056' with ref_name='NC_000016.9', ref_length=90354753, flags=99, pos=28669300 cannot be indexed

samtools index: failed to create index for "aln_2021-502_trimmed.bam"

0
Entering edit mode