Question

How to determine if SRA file is single or paired end?

17

Entering edit mode

9.4 years ago

davedeto ▴ 250

I have a situation where I want to run batch script to align reads from a bunch of different samples in a GEO accession. Some are single-ended and some are paired, but the meta-data in the series matrix file does not indicate which is which. Now, I can manually convert to fastq and inspect the files to determine it, but I'd like to find an automated way to do this. I know that the SRA file must have meta-data stored in it to explain where the split should occur, but I can't figure out how to get at it. The only thing that looks like it might be what I want is the sra-stat program in the sra toolkit, however I can't find any documentation on its output, and the default text output is just a cryptic series of numbers divided up by colons/pipes.

I could always run sra-stat with the -s option, output as XML, and find the answer there, but this requires the routine to go through the entire file, which takes a while. I could also just run fastq-dump with the --split-files option and look to see if I get one or two files as a result, but this also seems like a bit of a hack. Is there a better way?

It feels like there should be some header information in the file that I could quickly access.

sequencing • 18k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.4 years ago by davedeto ▴ 250

score 10 · Answer 1 · 2015-04-23

10

Entering edit mode

9.4 years ago

Kamil ★ 2.3k

You might be interested to try my script:

ADD COMMENT • link 9.4 years ago by Kamil ★ 2.3k

4

Entering edit mode

This is a great idea to use the --split-spot option of the fastq-dump. Although your way above is definitely good, I think that davedeto has a slightly simpler solution which I incorporate here:

srr="SRR3184279"
numLines=$(fastq-dump -X 1 -Z --split-spot $srr | wc -l)
if [ $numLines -eq 4 ]
then
  echo "$srr is single-end"
else
  echo "$srr is paired-end"
fi

ADD REPLY • link 7.3 years ago by jabelsky ▴ 40

0

Entering edit mode

cool! very simplified, thanks ! :)

ADD REPLY • link 5.6 years ago by Geparada ★ 1.5k

Ram · Answer 2 · 2015-04-24

6

Entering edit mode

9.4 years ago

davedeto ▴ 250

Kamil's suggestion to just use -X 1 and look at the first read was great! Thanks

I made this into a python function and thought I'd share in case anyone else wants to use it.

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.4 years ago by davedeto ▴ 250

0

Entering edit mode

Hi @davedeto,

I liked your python script to find the single-end or paired-end. I am very new to pytthon and in my case I have fastq files generated from illumina sequencing. In my case, the paired-end reads name are samplename_1_sequence.txt.gz and single-end reads name samplename_sequence.txt.gz. If I want to use the above script to my filenames, how would I chane it?

Kindly guide me

ADD REPLY • link 5.2 years ago by EVR ▴ 610

score 3 · Answer 3 · 2016-09-09

3

Entering edit mode

8.0 years ago

zpliu ▴ 60

Another simple way is to check the SRR ID of your sample in SRA Run Browser: http://trace.ncbi.nlm.nih.gov/Traces/sra/

"Browse" -> "Run Browser" -> then input your ID

The LAYOUT result will tell you. Also, the 'Reads' label shows 1 read for single end, and 2 reads for paired end.

ADD COMMENT • link 8.0 years ago by zpliu ▴ 60