Is there a way to know sequences length when querying ENA programmatically?
2
1
Entering edit mode
6.4 years ago
Pavel Senin ★ 1.9k

Hi folks:

I am trying to get raw RNASeq data from ENA and wonder how to know the length of the sequences in those FASTQ files without downloading them?

In the query design I follow their guide, and use the "nominal_length>XXX" in the query, but this filtering mechanism doesn't work... 

 

Thanks!

RNA-Seq ENA • 1.8k views
ADD COMMENT
2
Entering edit mode
6.4 years ago
Renesh ★ 2.0k

Get the count of total bases and number of reads from ENA, and do below calculation,

Read length = Total Bases/ number of reads
ADD COMMENT
0
Entering edit mode

That works, but you also need to divide by 2 for paired-end libraries :)!

ADD REPLY
1
Entering edit mode
6.4 years ago
Chris S. ▴ 310

The Paired nominal length does not seem to work in the ENA query builder here or at least it doesn't autofill into the query box, but you can add it to the URL string.  For example,  in the query builder, select domain = read and Library layout= Paired and optionally a taxon name like Bacteria , then you get this query.

tax_eq(2) AND library_layout="PAIRED"

Hit search and then add nominal length to the URL and get this URL string and query

tax_eq(2) AND library_layout="PAIRED" AND nominal_length>300

The nominal_length is just a filter, so you may need to check read experiment XML for actual length and stdev. 

UPDATE : add &result=read_experiment&display=xml&download=txt to the URL and parse NOMINAL_LENGTH from the summary XML

 

 

ADD COMMENT
0
Entering edit mode

also accepted. XML is a better way to get data, parsing/debugging it is the pain though.

ADD REPLY

Login before adding your answer.

Traffic: 2632 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6