Entering edit mode
2.5 years ago
melissachua90
▴
70
I want to download a dataset PRJNA281410
from SRA and corresponding reference genome (fasta format) . My code
esearch -db sra -query PRJNA281410 \
| elink -target assembly \
| efetch -format docsum \
| xtract -pattern DocumentSummary -element FtpPath GenBank \
| cut -d ',' -f 1 \
| grep SRR \
| xargs -n 1 -P 4 fastq-dump --split-files --gzip --skip-technical SRR18781516
Warning of skipped lines (many lines of the following warning):
fastq-dump warn: too many reads at spot id XXX, maximum YY supported, skipped
References:
If you are using
xargs
to pass values why do you have a fixedSRR18781516
at the end of your command? Additionally your command as posted does not work past the first search step. Many of these datasets are PacBio so I don't think that blanketfastq-dump
command will work.Finally not sure what you mean by
Are you looking to get the reference genomes for the two bacteria that are part of the data?
fastq-dump
threw an error without specifying the SRRYou can check out ENA which also provides fastq files.