Question

how do I split fastq from SRA?

0

Entering edit mode

5.6 years ago

jaqx008 ▴ 110

Hello everyone, So I was going to run an analysis on a seq data available on ncbi for various developmental stages of an organism. I decided to download this data and for some reason, some of the data occurred in twos, for instance, Blastula stage data 1 blastula data two. I dont know if both files are the same thing even though they differ slightly in size e.g 3GB and 3.5GB respectively. Secondly, when I download the file, they dont come as SRA but rather as fastq, I assume fastq-dump takes sra and not fastq. is this normal and what command can I use to split the fastq into forward and reverse strand so I can run Bowtie2 on them? Below is the command Im using for the split btw and I hope that is the correct command. Thanks

fastq-dump --split-3 blastula2.fastq

RNA-Seq fastq-dump SRA split-fastq • 5.6k views

ADD COMMENT • link updated 5.6 years ago by piyushjo ▴ 700 • written 5.6 years ago by jaqx008 ▴ 110

0

Entering edit mode

for some reason, some of the data occurred in twos, for instance, Blastula stage data 1 blastula data two.

That probably has nothing to do the sequence data. It may have to do with the actual experiment since those two things seem to refer to two stages of blastula.

Where possible search EBA-ENA with accession ID's so you can download the fastq files directly without having to worry about SRA and fastq-dump..

ADD REPLY • link 5.6 years ago by GenoMax 142k

0

Entering edit mode

But the files I downloaded are already in fastq. I just need to split it into forward reads and reverse reads file to use as input for bowtie 2. Any ideas on how I can achieve this?

ADD REPLY • link 5.6 years ago by jaqx008 ▴ 110

0

Entering edit mode

I just need to split it into forward reads and reverse reads file to use as input for bowtie 2.

Are you sure about that? If you have reads in interleaved format then reformat.sh from BBMap suite can be used to separate the R1 and R2 reads.

reformat.sh in=interleaved.fq out1=R1.fq out=R2.fq

ADD REPLY • link 5.6 years ago by GenoMax 142k

0

Entering edit mode

So I ran this and only R2.fq was made. does that mean my read is not paired end? how do I know if the original fastq is or is not paired end please? And generating only R2.fq means it is not paired-end reads, could the second fastq file (named blastula1) on NCBI be the second read? it looks like this on NCBI SRA site.

Submitted by: UT-BS
Study: EXPANDE project
PRJDB3785 • DRP003810 • All experiments • All runs
show Abstract
Sample: Bf_blastula_1
SAMD00028076 • DRS049884 • All experiments • All runs
Organism: Branchiostoma floridae
Library:
Name: Bf_blastula_1
Instrument: Illumina HiSeq 2000
Strategy: RNA-Seq
Source: TRANSCRIPTOMIC
Selection: other
Layout: SINGLE
Construction protocol: Total RNA (QIAGEN RNeasy) followed by TruSeq
Spot descriptor:
1  forward

Runs: 1 run, 56.1M spots, 5.7G bases, 3.3Gb

ADD REPLY • link 5.6 years ago by jaqx008 ▴ 110

0

Entering edit mode

This is indeed a single-end dataset. Confirmed by a single fastq available from ENA.

ADD REPLY • link 5.6 years ago by GenoMax 142k

0

Entering edit mode

I am somehow an intermediate level in this field. So I assume the second file that say blastula_2 would be the second or reverse read and the blastula_1 would be the forward read?

ADD REPLY • link 5.6 years ago by jaqx008 ▴ 110

0

Entering edit mode

Post the example of SRR # for blastula_2 data so I can check.

Looking at the project listing on ENA they all appear to be single-end datasets. blastula_1 and blastula_2 could be biological replicates but I doubt they are two parts of paired-end reads.

ADD REPLY • link 5.6 years ago by GenoMax 142k

0

Entering edit mode

It is true that they might be replicates because some of the developmental stages have more than two files X_1, X_2, X_3 ... which means it could be replicate. If that is the case, can I use the single end read for bowtie2 mapping? below is the second file page:

Submitted by: UT-BS
Study: EXPANDE project
PRJDB3785 • DRP003810 • All experiments • All runs
show Abstract
Sample: Bf_blastula_1
SAMD00028076 • DRS049884 • All experiments • All runs
Organism: Branchiostoma floridae
Library:
Name: Bf_blastula_1
Instrument: Illumina HiSeq 2000
Strategy: RNA-Seq
Source: TRANSCRIPTOMIC
Selection: other
Layout: SINGLE
Construction protocol: Total RNA (QIAGEN RNeasy) followed by TruSeq
Spot descriptor:
1  forward

Runs: 1 run, 56.1M spots, 5.7G bases, 3.3Gb
Run # of Spots  # of Bases  Size    Published
DRR032679   56,083,328  5.7G    3.3Gb   2017-09-20

ADD REPLY • link 5.6 years ago by jaqx008 ▴ 110

0

Entering edit mode

This is the same sample page as you posted before.

ADD REPLY • link 5.6 years ago by GenoMax 142k

0

Entering edit mode

My bad

Submitted by: UT-BS
Study: EXPANDE project
PRJDB3785 • DRP003810 • All experiments • All runs
show Abstract
Sample: Bf_blastula_2
SAMD00028077 • DRS049885 • All experiments • All runs
Organism: Branchiostoma floridae
Library:
Name: Bf_blastula_2
Instrument: Illumina HiSeq 2000
Strategy: RNA-Seq
Source: TRANSCRIPTOMIC
Selection: other
Layout: SINGLE
Construction protocol: Total RNA (QIAGEN RNeasy) followed by TruSeq
Spot descriptor:
1  forward

Runs: 1 run, 57.1M spots, 5.8G bases, 3.4Gb
Run # of Spots  # of Bases  Size    Published
DRR032680   57,063,175  5.8G    3.4Gb   2017-09-20

ADD REPLY • link 5.6 years ago by jaqx008 ▴ 110

score 3 · Answer 1 · 2018-10-13

3

Entering edit mode

5.6 years ago

piyushjo ▴ 700

use the following format I have used it and it works for me

fastq-dump --origfmt --gzip --split-files SRR565555 (example SRR or ERR)

ADD COMMENT • link 5.6 years ago by piyushjo ▴ 700

0

Entering edit mode

But the files I downloaded are already in fastq. I just need to split it into forward reads and reverse reads file to use as input for bowtie 2

ADD REPLY • link 5.6 years ago by jaqx008 ▴ 110

2

Entering edit mode

if it was paired end it would be split, if the readings weren't paired they won't split.

ADD REPLY • link 5.6 years ago by piyushjo ▴ 700