Question

command to extract SRA fastq data summary

0

Entering edit mode

8 months ago

Nelo ▴ 20

Hi,

I was trying to calculate the total read length of all the sample present in bioproject in command utility as:

esearch -db bioproject -query "PRJNA438426" | efetch -format docsum | xtract -pattern DocumentSummary -element RunTotalBases

which is giving me no output.

In another way

grep -c "^@" SRR6943205_1.fastq SRR6943205_2.fastq 
SRR6943205_1.fastq:18361521
SRR6943205_2.fastq:18361521

This is giving me the desired output but only when they are downloaded in fastq format in my system.

I have downloaded the SRA data in fastq format long ago, but they are already compressed (.fastq.gz). So in order to calculate the read length using second code snippet, I have to gunzip them all which is time as well as resource consuming for me now.

third option is to check every single fastqc html output of each sample individually which is also time taking.

I would greatly appreciate if someone could help me get me design a linux command to get my desired output.

Many thanks!

edirect NCBI SRA • 693 views

ADD COMMENT • link updated 8 months ago by GenoMax 141k • written 8 months ago by Nelo ▴ 20

0

Entering edit mode

8 months ago

GenoMax 141k

You need to do the following:

$ esearch -db bioproject -query "PRJNA438426" | elink -target sra | efetch -format runinfo | cut -d "," -f1,5
Run,bases
SRR6943215,6590673000
SRR6943216,6714951000
SRR6943213,7042552800
SRR6943214,6215158500
SRR6943219,7075158000
SRR6943220,6898378200

ADD COMMENT • link 8 months ago by GenoMax 141k

score 2 · Accepted Answer · 2023-08-24

2

Entering edit mode

8 months ago

bk11 ★ 2.4k

I have downloaded the SRA data in fastq format long ago, but they are already compressed (.fastq.gz). So in order to calculate the read length using second code snippet, I have to gunzip them all which is time as well as resource consuming for me now.

You can use your gz file and find number of reads and read length in your *fastq.gz file as follows-

zcat Your_SRR_R1.fastq.gz |awk '{if(NR%4==2) print length($1)}' | sort -n | uniq -c > R1_length.txt
cat R1_length.txt #(Will show you number of reads in *R1.fastq.gz in first column and read length in the second column)
20299247 28

zcat Your_SRR_R1.fastq.gz |awk '{if(NR%4==2) print length($1)}' | sort -n | uniq -c > R2_length.txt
cat R2_length.txt #(Will show you number of reads in *R2.fastq.gz in first column and read length in the second column)
20299247 91

ADD COMMENT • link 8 months ago by bk11 ★ 2.4k

0

Entering edit mode

Thank you for providing this code. Its really help me. Is there any way to count GC% of the sample using its accession?

ADD REPLY • link 8 months ago by Nelo ▴ 20

0

Entering edit mode

To calculate GC%, you can do-

zcat Your_SRR_R1.fastq.gz |awk '(NR%4==2) {A1+=length($0);gsub(/[AT]/,"");A2+=length($0);}END{print A2/A1;}' >r1_gc-pct.txt
cat r1_gc-pct.txt
0.482669 #(ie 48.26%)

ADD REPLY • link 8 months ago by bk11 ★ 2.4k