Question

I downloaded fastq files from a repository and tried to run fastqc, how can the average sequence length be only 8 bp?

0

Entering edit mode

2.6 years ago

lapower • 0

I downloaded sequencing files from 2 patients from here: https://www.ebi.ac.uk/ena/browser/view/PRJNA588461?show=reads

there is one fastq file for the forward (1) and reverse (2) reads.

I wanted to look at the quality of the data using fastqc, which I ran simply with fastqc *fastq.gz

The result was that the average forward seuqnence length was 8 pb while the average reverse sequence length was 76. How is this possible?

I am new the analyzing scRNAseq data, perhaps theres a huge step I am missing?

Thanks!

scRNA-seq fastqc fastq • 1.1k views

ADD COMMENT • link updated 2.6 years ago by GenoMax 141k • written 2.6 years ago by lapower • 0

score 1 · Answer 1 · 2021-09-15

1

Entering edit mode

2.6 years ago

GenoMax 141k

how can the average sequence length be only 8 bp

Because that file only contains index sequence used to label that sample. Did you look inside that file. Every read would basically be the same index sequence. There is no point in running FastQC on that file.

ADD COMMENT • link 2.6 years ago by GenoMax 141k

0

Entering edit mode

GenoMax

Here are the first few lines from each file, how could the first file, ending in "_1" be the index file? Shouldn't this paired end sequencing result in the forward read and reverse read? Or if this first file is the index file, do I disregard it and only analyze the second file (the next step being to align the file with a reference genome)? Thanks!

zcat SRR10419623_1.fastq.gz | head

@SRR10419623.1 1/1
NTTCATGA
+
#A<AFJJF
@SRR10419623.2 2/1
NTTCATGA
+
#AAAFFJJ
@SRR10419623.3 3/1
NTTCATGA


zcat SRR10419623_2.fastq.gz | head

@SRR10419623.1 1/2
NNNAGAAACATACAATTCTTAAGTTATGCCTCTTAAACACATGAAGCACCAATTTTGTTAAAGACTGCCTAGATTT
+
###-<7--A7-7AAJJJAFJJJJFJJFJJAJJJJJJJJJJFA<AAAJA7AA<AJAJAAFJFJ<FJAJFJJ-FAJJJ
@SRR10419623.2 2/2
NNNTTGGCTGACTAGACTCATTATCTCTGTGAAGTTAGCAACTCTTAACCTCAATTTTGAATTTGAACTTATAATA
+
###-<7-7FJ-<FFJAJAJJJJJJJJJJJJJFFJJJJJJJJJJJJJ<JFJJ<AJFJFJJJJJJFJJAJJJJJJFJF
@SRR10419623.3 3/2
NNNTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCCGGCTAAAAGGGGTGAAATCCCGT

ADD REPLY • link 2.6 years ago by lapower • 0

1

Entering edit mode

10x sequencing uses paired-end reads but both reads don't contain transcriptome data. One of the reads is generally cell barcodes + UMI (it is ~28 bp) with second read representing the insert. One generally needs both reads for analysis.

This dataset appears to be done with v.2 of 10x kit so the sequencing requirements are 26 x 8 x 98 bp (LINK). Either the submitters did not submit the dataset correctly or repositories did something wrong and now you have this odd 8 x 76 bp read structure. If you look at the SRA run browser for one of these samples (LINK) you can see the Data Access tab. Originally submitted R1,I1,R2 data files (which presumably will be R1=26 bp, I1=8 bp, R2=98bp). You will need a google cloud account to access those. But that should be the data in right format.

ADD REPLY • link 2.6 years ago by GenoMax 141k