How to determine if SRA file is single or paired end?
3
14
Entering edit mode
7.6 years ago
davedeto ▴ 200

I have a situation where I want to run batch script to align reads from a bunch of different samples in a GEO accession. Some are single-ended and some are paired, but the meta-data in the series matrix file does not indicate which is which. Now, I can manually convert to fastq and inspect the files to determine it, but I'd like to find an automated way to do this. I know that the SRA file must have meta-data stored in it to explain where the split should occur, but I can't figure out how to get at it. The only thing that looks like it might be what I want is the sra-stat program in the sra toolkit, however I can't find any documentation on its output, and the default text output is just a cryptic series of numbers divided up by colons/pipes.

I could always run sra-stat with the -s option, output as XML, and find the answer there, but this requires the routine to go through the entire file, which takes a while. I could also just run fastq-dump with the --split-files option and look to see if I get one or two files as a result, but this also seems like a bit of a hack. Is there a better way?

It feels like there should be some header information in the file that I could quickly access.

sequencing • 14k views
ADD COMMENT
8
Entering edit mode
7.6 years ago
Kamil ★ 2.1k

You might be interested to try my script:

ADD COMMENT
2
Entering edit mode

This is a great idea to use the --split-spot option of the fastq-dump. Although your way above is definitely good, I think that davedeto has a slightly simpler solution which I incorporate here:

srr="SRR3184279"
numLines=$(fastq-dump -X 1 -Z --split-spot $srr | wc -l)
if [ $numLines -eq 4 ]
then
  echo "$srr is single-end"
else
  echo "$srr is paired-end"
fi
ADD REPLY
0
Entering edit mode

cool! very simplified, thanks ! :)

ADD REPLY
4
Entering edit mode
7.6 years ago
davedeto ▴ 200

Kamil's suggestion to just use -X 1 and look at the first read was great! Thanks

I made this into a python function and thought I'd share in case anyone else wants to use it.

ADD COMMENT
0
Entering edit mode

Hi @davedeto,

I liked your python script to find the single-end or paired-end. I am very new to pytthon and in my case I have fastq files generated from illumina sequencing. In my case, the paired-end reads name are samplename_1_sequence.txt.gz and single-end reads name samplename_sequence.txt.gz. If I want to use the above script to my filenames, how would I chane it?

Kindly guide me

ADD REPLY
2
Entering edit mode
6.2 years ago
zpliu ▴ 50

Another simple way is to check the SRR ID of your sample in SRA Run Browser: http://trace.ncbi.nlm.nih.gov/Traces/sra/

"Browse" -> "Run Browser" -> then input your ID

The LAYOUT result will tell you. Also, the 'Reads' label shows 1 read for single end, and 2 reads for paired end.

ADD COMMENT

Login before adding your answer.

Traffic: 1527 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6