Entering edit mode
4.0 years ago
MAPK
★
2.1k
I am trying to download dbGAP SRA samples. I used to use fastq-dump with the following command below, but for this particular project fastq-dump is running really slow because of larger datasets. So, I wanted to use fasterq-dump tool, but couldn't figure out how I could split reads per RG tags. I tried fasterq-dump with the same command below, but it looks like fasterq-dump doesn't have defline option. Any suggestions?
This is the command I use with fastq-dump:
prefetch --ngc /dbGaP/prj_222.ngc -X 9999999999999 ${SRR}
IFS=$'\n'
RGLINES=($(sam-dump --ngc /dbGaP/prj_222.ngc ./${SRR} | sed -n '/^[^@]/!p;//q' | grep ^@RG))
args=(tee)
for RGLINE in ${RGLINES[@]}; do
unset IFS
RG=(${RGLINE})
args+=(\>\(grep -A3 --no-group-separator \"\\.${RG[1]#ID:}/[12]$\" \| gzip \> "./${SRR}.${RG[1]#ID:}.fastq-dump.split.defline.z.tee.fq.gz"\))
done
args+=(\>/dev/null)
echo "Splitting ${SRR}.sra into ${#RGLINES[@]} ReadGroups"
fastq-dump-orig.2.10.8 --ngc /dbGaP/prj_222.ngc --split-3 --defline-seq '@$ac.$si.$sg/$ri' --defline-qual '+' -Z "./${SRR}" | eval ${args[@]}
You should use
prefetch
to first download the SRA file and then use fastq-dump on that file. I am almost certain that fastq-dump alone will not manage to download large files without at least one connection error.prefetch
is much more stable. See the last section of Fast download of FASTQ files from the European Nucleotide Archive (ENA)I actually downloaded SRA with
prefetch
first and then used that infastq-dump
-Z "./${SRR}"
. Not sure if this is the correct way to use downloaded SRA folder.