Getting wrong with split SRA files to paired end reads
1
0
Entering edit mode
3.8 years ago
c910816946 • 0

I download data from ENA, the bioproject is: PRJNA545730

I split SRA files with:

fastq-dump --split-files SRR9167437

It generates 4 files:

SRR9167437_1.fastq
SRR9167437_2.fastq
SRR9167437_3.fastq
SRR9167437_4.fastq

head each of these files output:

head SRR9167437_1.fastq

@SRR9167437.1 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10000:48224 length=50
GATGCAGATTAAGCAAGCACCACACACCACCCCCAACAACCGCCCCGGGG
+SRR9167437.1 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10000:48224 length=50
<BB###############################################
@SRR9167437.2 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10000:6480 length=50
AAGTTTAAGGTACTGCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+SRR9167437.2 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10000:6480 length=50
/BBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBF##
@SRR9167437.3 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10001:25061 length=50
TTCCGGTTGATCGCTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

head SRR9167437_2.fastq

@SRR9167437.1 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10000:48224 length=50
GTTCCTCTCACCATAAAATGAGGAATCCAGATTGTTTCAAAGGATGGTGC
+SRR9167437.1 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10000:48224 length=50
BBBBBFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@SRR9167437.2 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10000:6480 length=50
TCCCAGGGGTTCGATAGAAGGAGGATTTCAGCTTTGCCCAAGAATGTCTA
+SRR9167437.2 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10000:6480 length=50
BBBBBFFFBFFFF/FFFFFFFFFBFF<FFFFFF<BFFFFFFFFFB<BFF<
@SRR9167437.3 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10001:25061 length=50
GTAAACATATTTTTAATGCATACTTAAGTAATATTTAAGAAACTAAACAA

head SRR9167437_3.fastq

@SRR9167437.1 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10000:48224 length=10
ACCAGGCGCA
+SRR9167437.1 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10000:48224 length=10
BBBBBFFFFF
@SRR9167437.2 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10000:6480 length=10
ACCAGGCGCA
+SRR9167437.2 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10000:6480 length=10
BBBBBFFFFF
@SRR9167437.3 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10001:25061 length=10
ACCAGGCGCA

head SRR9167437_4.fastq

@SRR9167437.1 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10000:48224 length=10
GATGCAGTTC
+SRR9167437.1 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10000:48224 length=10
BBBBBFFFFF
@SRR9167437.2 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10000:6480 length=10
GATGCAGTTC
+SRR9167437.2 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10000:6480 length=10
BBBBBFFFF<
@SRR9167437.3 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10001:25061 length=10
GATGCAGTTC

I tried with not split files:

fastq-dump SRR9167437
head SRR9167437.fastq

@SRR9167437.1 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10000:48224 length=120
GATGCAGATTAAGCAAGCACCACACACCACCCCCAACAACCGCCCCGGGGGTTCCTCTCACCATAAAATGAGGAATCCAGATTGTTTCAAAGGATGGTGCACCAGGCGCAGATGCAGTTC
+SRR9167437.1 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10000:48224 length=120
<BB###############################################BBBBBFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBBFFFFFBBBBBFFFFF
@SRR9167437.2 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10000:6480 length=120
AAGTTTAAGGTACTGCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCCAGGGGTTCGATAGAAGGAGGATTTCAGCTTTGCCCAAGAATGTCTAACCAGGCGCAGATGCAGTTC
+SRR9167437.2 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10000:6480 length=120
/BBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBF##BBBBBFFFBFFFF/FFFFFFFFFBFF<FFFFFF<BFFFFFFFFFB<BFF<BBBBBFFFFFBBBBBFFFF<
@SRR9167437.3 700175F:CAPTEANXX170817:CAPTEANXX:1:1101:10001:25061 length=120
TTCCGGTTGATCGCTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTAAACATATTTTTAATGCATACTTAAGTAATATTTAAGAAACTAAACAAACCAGGCGCAGATGCAGTTC

I'm wordering why it generate 4 files?

Is there any ways to split paired end to read1 and read2 for my data?

Thanks!

RNA-Seq • 765 views
ADD COMMENT
2
Entering edit mode

Judging by the length of reads in fastqs 3 and 4, my guess would be that they had barcodes and/or UMIs present in the forward and/or reverse adapters that got sequenced. In order to definitively answer this you would need to know the library prep kit or sequencing adapter structure. That info is sometimes included in the GEO submission, and should be in the paper.

ADD REPLY
0
Entering edit mode

Thanks rpolicastro, I read the SRA page that author of the paper write annotation for the data. _3.fastq and _4.fastq is acturally barcodes.

Thanks again! Baoqiang.

ADD REPLY
0
Entering edit mode

A couple of points:

  1. Please don't use bold for text that doesn't really need emphasis - it takes away from the effect.
  2. Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
    code_formatting
ADD REPLY

Login before adding your answer.

Traffic: 2531 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6