Question: Weird base qualities and sequences from FastQ file?
0
gravatar for germelcar
2.4 years ago by
germelcar20
Mexico/Ensenada/CICESE
germelcar20 wrote:

Hi everyone:

I have some difficulties, it is new and weird for me and I need some help with recommendations/advices.

I realize that the fastq files have a lot lot of reads with qualities marked as "#" and the bases are marked as "N" for the entire read, and in some other reads, the bases are maked fine for some couple of bases and later the read finish with a lot of "N". Here are the head and tail of 2 spots and some extract from the middle of the /1 read.

head

@SRR1812885.1 HWUSI-ES1807:12:FC:7:1:7609:1000/1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
######################################################################
@SRR1812885.2 HWUSI-ES1807:12:FC:7:1:10872:1000/1 ]
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
######################################################################

tail

@SRR1812885.70669746 HWUSI-ES1807:12:FC:7:120:13034:23950/1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
######################################################################
@SRR1812885.70669747 HWUSI-ES1807:12:FC:7:120:3941:23950/1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
######################################################################

some part of the /1.fastq file

@SRR1812885.415516 HWUSI-ES1807:12:FC:7:2:4966:1022/1
NCTCAAGTCATCATGTTCTTGATGTTTACGACGATAGTCTTGTTCAAACCTATCCAATGCATGCAATTCT
+
######################################################################
@SRR1812885.415517 HWUSI-ES1807:12:FC:7:2:16684:1022/1
NGACTGGCACACCCGCTACAAAATTATCAAGGGAACCTGCGAGGGCCTAAAATATCTTCATGAGTTGATG
+
#*((*,-,-+@@@@@@@8@@################################################## 
@SRR1812885.415518 HWUSI-ES1807:12:FC:7:2:1971:1022/1
NGTCTTTGTACAATCTCTTCCACCAATACACAGCATCCATATAATGTAGGATCATCAGCAACCTGTAAAC
+
#*++)-//-/@@@@@@C@@@:<:<<25777@@@@@837997745598979:::::<<8802222211433
@SRR1812885.415519 HWUSI-ES1807:12:FC:7:2:19373:1022/1
NTGGGCATAGGTTATATCTATTTTGCCAGTCAGCATGTTGCAGCTATTTCAAGGCATGGTGTTCTATGCT
+
#)).(+,-+*+22210,-0,:::22@@0@@8::::@@5@@##############################
@SRR1812885.415520 HWUSI-ES1807:12:FC:7:2:5742:1022/1
NCGCCATCTGAGAAAAGCACGCCTTGCCACAAGCTCCTTTCCATTGCGTTCTCTGCGTGCAGCATCTGCT
+
######################################################################
@SRR1812885.415521 HWUSI-ES1807:12:FC:7:2:1502:1023/1
NAATTCCATACTTTGAATACTAGTTATGAGGTGATACTTAGGGACAAAGCAGTCTTTTCAAAAATCCAAG
+
######################################################################

Is everything OK? It is only that it looks strange for me, never before I have been saw something similar. Thanks in advance.

rna-seq next-gen fastq assembly • 895 views
ADD COMMENTlink modified 2.4 years ago by mastal5112.0k • written 2.4 years ago by germelcar20

How was the data obtained? Which processing steps were performed?

ADD REPLYlink written 2.4 years ago by WouterDeCoster38k

Hi WouterDeCoster, thanks for your reply.

I will put a copy in verbatim of the "library preparation and Illumina sequencing" section of the paper:

The quality of total RNA was checked using a NanoDrop 2000 (Thermo Fisher, USA) and an Agilent 2100 Bioana- lyzer (Agilent, USA). The RNA Integrity Number of RNA obtained was greater than 8. mRNA was purified from total RNA by RNA purification beads, then fragmented and primed for cDNA synthesis according to the manufactur- er’s instructions (Illumina, USA). Double-stranded cDNA was synthesized using the SuperScript Double-Stranded cDNA Synthesis kit (Invitrogen, USA) and then purified using Agencourt AMPure XP beads (Beckman Coulter, Inc, USA). End repairing and 3′-ends adenylation were per- formed following the RNA adapters ligation. After enrich- ment of DNA fragments library templates were validated using the Agilent 2100 Bioanalyzer (Agilent, USA). Using TruSeq PE Cluster Kit v2 and cBot automated system (Illu- mina, USA) clonal clusters were created from DNA library templates. Clusters obtained were finally used to perform paired-end runs by Genome Analyzer IIx (Illumina, USA).

I am not sure if that is what you are asking for. Thanks in advance.

ADD REPLYlink written 2.4 years ago by germelcar20
2
gravatar for genomax
2.4 years ago by
genomax65k
United States
genomax65k wrote:

It is possible that your sequence provider included reads that had failed internal illumina quality control in this output. You can use bbduk.sh from BBMap with qtrim=rl trimq=1 to remove those N calls.

@mastal511 makes a perfect observation, which is the likely cause.

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by genomax65k
2
gravatar for mastal511
2.4 years ago by
mastal5112.0k
mastal5112.0k wrote:

I notice that the read headers start with @SRR1812885, indicating that this is data from the SRA. This looks like older sequencing data, where it was typical to find low quality reads at the edges of the flowcell, which ended up at the start and the end of the fastq files. It was quite typical to have reads made up of Ns and very low base qualities like '#' at the beginning and end of the fastq files. As long as most of the data in the middle of the file is OK, it looks normal.

ADD COMMENTlink written 2.4 years ago by mastal5112.0k

Hi mastal511, thanks for your reply.

I do not know what do you mean by older sequencing data, how new sequencing data look like? I have understood that the data was sequenced using HiSeq 2000, which I have understood is not too old.

Do you recommend me to use the reads without a processing/filtering (like quality trimming) them?

Thanks in advance.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by germelcar20
1

No, you should definitely use something like trimmomatic or other quality/adapter trimmer to get rid of low quality reads/parts of reads.

I looked at the SRA page, too. In one section it said HiSeq2000, in another section it said GAII. It looks much more like what I used to see from GAII data, but it may just depend on to what extent the data has been filtered by the sequencer software.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by mastal5112.0k

Thanks for your help. The authors say the following about the filtering process about the reads:

70 bp paired-end reads were prepared for assembly by Q15 filtering, removal of library adapter sequences, removal of A/T stretches and short reads (less than 60 bp). The percent of rejected short reads was 0.6 %.

I have understood that the recommended quality threshold is >= 30 (keeping those reads with that quality). Should I use a quality trimming with that threshold?

Many thanks.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by germelcar20

Try using trimmomatic with some of the default threshold values, I'm not sure if it has a threshold for the quality of the whole read.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by mastal5112.0k

Thanks. I will try trimmomatic and also BBMap as @genomax2 has suggested me.

Many thanks for the help.

ADD REPLYlink written 2.4 years ago by germelcar20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 752 users visited in the last hour