How to determine if SRA file is single or paired end?
7.4 years ago
davedeto ▴ 180

I have a situation where I want to run batch script to align reads from a bunch of different samples in a GEO accession. Some are single-ended and some are paired, but the meta-data in the series matrix file does not indicate which is which. Now, I can manually convert to fastq and inspect the files to determine it, but I'd like to find an automated way to do this. I know that the SRA file must have meta-data stored in it to explain where the split should occur, but I can't figure out how to get at it. The only thing that looks like it might be what I want is the sra-stat program in the sra toolkit, however I can't find any documentation on its output, and the default text output is just a cryptic series of numbers divided up by colons/pipes.

I could always run sra-stat with the -s option, output as XML, and find the answer there, but this requires the routine to go through the entire file, which takes a while. I could also just run fastq-dump with the --split-files option and look to see if I get one or two files as a result, but this also seems like a bit of a hack. Is there a better way?

It feels like there should be some header information in the file that I could quickly access.

7.4 years ago
Kamil ★ 2.1k

You might be interested to try my script:

This is a great idea to use the --split-spot option of the fastq-dump. Although your way above is definitely good, I think that davedeto has a slightly simpler solution which I incorporate here:

srr="SRR3184279"
numLines=$(fastq-dump -X 1 -Z --split-spot$srr | wc -l)
if [ $numLines -eq 4 ] then echo "$srr is single-end"
else
echo "\$srr is paired-end"
fi

cool! very simplified, thanks ! :)

7.4 years ago
davedeto ▴ 180

Kamil's suggestion to just use -X 1 and look at the first read was great! Thanks

I made this into a python function and thought I'd share in case anyone else wants to use it.

Hi @davedeto,

I liked your python script to find the single-end or paired-end. I am very new to pytthon and in my case I have fastq files generated from illumina sequencing. In my case, the paired-end reads name are samplename_1_sequence.txt.gz and single-end reads name samplename_sequence.txt.gz. If I want to use the above script to my filenames, how would I chane it?

Kindly guide me

6.1 years ago
zpliu ▴ 50

Another simple way is to check the SRR ID of your sample in SRA Run Browser: http://trace.ncbi.nlm.nih.gov/Traces/sra/

"Browse" -> "Run Browser" -> then input your ID

The LAYOUT result will tell you. Also, the 'Reads' label shows 1 read for single end, and 2 reads for paired end.