Question: how do I split fastq from SRA?
0
gravatar for jaqx008
19 months ago by
jaqx00870
jaqx00870 wrote:

Hello everyone, So I was going to run an analysis on a seq data available on ncbi for various developmental stages of an organism. I decided to download this data and for some reason, some of the data occurred in twos, for instance, Blastula stage data 1 blastula data two. I dont know if both files are the same thing even though they differ slightly in size e.g 3GB and 3.5GB respectively. Secondly, when I download the file, they dont come as SRA but rather as fastq, I assume fastq-dump takes sra and not fastq. is this normal and what command can I use to split the fastq into forward and reverse strand so I can run Bowtie2 on them? Below is the command Im using for the split btw and I hope that is the correct command. Thanks

fastq-dump --split-3 blastula2.fastq

ADD COMMENTlink modified 19 months ago by piyushjo470 • written 19 months ago by jaqx00870

for some reason, some of the data occurred in twos, for instance, Blastula stage data 1 blastula data two.

That probably has nothing to do the sequence data. It may have to do with the actual experiment since those two things seem to refer to two stages of blastula.

Where possible search EBA-ENA with accession ID's so you can download the fastq files directly without having to worry about SRA and fastq-dump..

ADD REPLYlink modified 19 months ago • written 19 months ago by genomax83k

But the files I downloaded are already in fastq. I just need to split it into forward reads and reverse reads file to use as input for bowtie 2. Any ideas on how I can achieve this?

ADD REPLYlink written 19 months ago by jaqx00870

I just need to split it into forward reads and reverse reads file to use as input for bowtie 2.

Are you sure about that? If you have reads in interleaved format then reformat.sh from BBMap suite can be used to separate the R1 and R2 reads.

reformat.sh in=interleaved.fq out1=R1.fq out=R2.fq
ADD REPLYlink written 19 months ago by genomax83k

So I ran this and only R2.fq was made. does that mean my read is not paired end? how do I know if the original fastq is or is not paired end please? And generating only R2.fq means it is not paired-end reads, could the second fastq file (named blastula1) on NCBI be the second read? it looks like this on NCBI SRA site.

Submitted by: UT-BS
Study: EXPANDE project
PRJDB3785 • DRP003810 • All experiments • All runs
show Abstract
Sample: Bf_blastula_1
SAMD00028076 • DRS049884 • All experiments • All runs
Organism: Branchiostoma floridae
Library:
Name: Bf_blastula_1
Instrument: Illumina HiSeq 2000
Strategy: RNA-Seq
Source: TRANSCRIPTOMIC
Selection: other
Layout: SINGLE
Construction protocol: Total RNA (QIAGEN RNeasy) followed by TruSeq
Spot descriptor:
1  forward

Runs: 1 run, 56.1M spots, 5.7G bases, 3.3Gb
ADD REPLYlink modified 19 months ago • written 19 months ago by jaqx00870

This is indeed a single-end dataset. Confirmed by a single fastq available from ENA.

ADD REPLYlink written 19 months ago by genomax83k

I am somehow an intermediate level in this field. So I assume the second file that say blastula_2 would be the second or reverse read and the blastula_1 would be the forward read?

ADD REPLYlink written 19 months ago by jaqx00870

Post the example of SRR # for blastula_2 data so I can check.

Looking at the project listing on ENA they all appear to be single-end datasets. blastula_1 and blastula_2 could be biological replicates but I doubt they are two parts of paired-end reads.

ADD REPLYlink written 19 months ago by genomax83k

It is true that they might be replicates because some of the developmental stages have more than two files X_1, X_2, X_3 ... which means it could be replicate. If that is the case, can I use the single end read for bowtie2 mapping? below is the second file page:

Submitted by: UT-BS
Study: EXPANDE project
PRJDB3785 • DRP003810 • All experiments • All runs
show Abstract
Sample: Bf_blastula_1
SAMD00028076 • DRS049884 • All experiments • All runs
Organism: Branchiostoma floridae
Library:
Name: Bf_blastula_1
Instrument: Illumina HiSeq 2000
Strategy: RNA-Seq
Source: TRANSCRIPTOMIC
Selection: other
Layout: SINGLE
Construction protocol: Total RNA (QIAGEN RNeasy) followed by TruSeq
Spot descriptor:
1  forward

Runs: 1 run, 56.1M spots, 5.7G bases, 3.3Gb
Run # of Spots  # of Bases  Size    Published
DRR032679   56,083,328  5.7G    3.3Gb   2017-09-20
ADD REPLYlink written 19 months ago by jaqx00870

This is the same sample page as you posted before.

ADD REPLYlink written 19 months ago by genomax83k

My bad

Submitted by: UT-BS
Study: EXPANDE project
PRJDB3785 • DRP003810 • All experiments • All runs
show Abstract
Sample: Bf_blastula_2
SAMD00028077 • DRS049885 • All experiments • All runs
Organism: Branchiostoma floridae
Library:
Name: Bf_blastula_2
Instrument: Illumina HiSeq 2000
Strategy: RNA-Seq
Source: TRANSCRIPTOMIC
Selection: other
Layout: SINGLE
Construction protocol: Total RNA (QIAGEN RNeasy) followed by TruSeq
Spot descriptor:
1  forward

Runs: 1 run, 57.1M spots, 5.8G bases, 3.4Gb
Run # of Spots  # of Bases  Size    Published
DRR032680   57,063,175  5.8G    3.4Gb   2017-09-20
ADD REPLYlink written 19 months ago by jaqx00870
3
gravatar for piyushjo
19 months ago by
piyushjo470
piyushjo470 wrote:

use the following format I have used it and it works for me

fastq-dump --origfmt --gzip --split-files SRR565555 (example SRR or ERR)

ADD COMMENTlink written 19 months ago by piyushjo470

But the files I downloaded are already in fastq. I just need to split it into forward reads and reverse reads file to use as input for bowtie 2

ADD REPLYlink written 19 months ago by jaqx00870
1

if it was paired end it would be split, if the readings weren't paired they won't split.

ADD REPLYlink written 19 months ago by piyushjo470
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2042 users visited in the last hour