Extracting single chromosome from WGS cram/bam files before downloading
1
0
Entering edit mode
7 weeks ago
berndmann • 0

For a study of VNTR copies I'm looking for high-coverage public-accessible WGS data such as 1KG,HGDP and SGDP. Since downloading the whole cram file if one just needs one chromosome can take a long time, I wonder if it is possible to subset the cram file to a single-chromosome bam file and just download this part of the data. If that is not possible I would like to ask for the fastest way to achieve this. Worst case would be download the whole file and subset locally but I would like to avoid this.

Best, Bernd

VNTR 1KG HGDP SGDP WGS • 442 views
ADD COMMENT
2
0
Entering edit mode

Great. Fantastic that samtools supports streaming remote files!

ADD REPLY
0
Entering edit mode
bernd:~$ samtools view -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/data/HG00154/alignment/HG00154.mapped.ILLUMINA.bwa.GBR.low_coverage.20101123.bam 17:7512445-7513455
[E::hts_open_format] Failed to open file "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/data/HG00154/alignment/HG00154.mapped.ILLUMINA.bwa.GBR.low_coverage.20101123.bam" : Protocol not supported
samtools view: failed to open "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/data/HG00154/alignment/HG00154.mapped.ILLUMINA.bwa.GBR.low_coverage.20101123.bam" for reading: Protocol not supported
bernd:~$ samtools --help

Program: samtools (Tools for alignments in the SAM format)
Version: 1.13 (using htslib 1.13+ds)

Usage:   samtools <command> [options]

When I try to run the example from their webpage it fails with "Protocol not supported"

ADD REPLY
0
Entering edit mode

Make sure you are using a newer version of samtools. Works with v.1.21 which is the latest. Your samtools is from 2021.

ADD REPLY
0
Entering edit mode

So it looks like it was just the samtools version. Thanks for the hint.

Subsetting a streamed cram file still seems to take a very long time on my end.

I do this: samtools view ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239482/NA12775.final.cram -r chrX --reference GRCh38_full_analysis_set_plus_decoy_hla.fa

This already runs for roughly 20 minutes. I would assume this should be done in less than 10?

`

ADD REPLY
2
Entering edit mode

wrong parameter.

-r, --read-group STR ...are in read group STR

you want

 samtools view ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239482/NA12775.final.cram  chrX --reference GRCh38_full_analysis_set_plus_decoy_hla.fa
ADD REPLY
0
Entering edit mode

use 'https://' instead of 'ftp://'

ADD REPLY

Login before adding your answer.

Traffic: 1419 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6