I downloaded sequencing files from 2 patients from here: https://www.ebi.ac.uk/ena/browser/view/PRJNA588461?show=reads
there is one fastq file for the forward (1) and reverse (2) reads.
I wanted to look at the quality of the data using fastqc, which I ran simply with fastqc *fastq.gz
The result was that the average forward seuqnence length was 8 pb while the average reverse sequence length was 76. How is this possible?
I am new the analyzing scRNAseq data, perhaps theres a huge step I am missing?
Thanks!
GenoMax
Here are the first few lines from each file, how could the first file, ending in "_1" be the index file? Shouldn't this paired end sequencing result in the forward read and reverse read? Or if this first file is the index file, do I disregard it and only analyze the second file (the next step being to align the file with a reference genome)? Thanks!
10x sequencing uses paired-end reads but both reads don't contain transcriptome data. One of the reads is generally cell barcodes + UMI (it is ~28 bp) with second read representing the insert. One generally needs both reads for analysis.
This dataset appears to be done with v.2 of 10x kit so the sequencing requirements are 26 x 8 x 98 bp (LINK). Either the submitters did not submit the dataset correctly or repositories did something wrong and now you have this odd 8 x 76 bp read structure. If you look at the SRA run browser for one of these samples (LINK) you can see the
Data Access
tab. Originally submitted R1,I1,R2 data files (which presumably will be R1=26 bp, I1=8 bp, R2=98bp). You will need a google cloud account to access those. But that should be the data in right format.