Question: Using SRA Toolkit and Samtools to look at reads mapping to one region in public database
gravatar for garbuzov
12 weeks ago by
garbuzov0 wrote:

Hi everyone, I'm new to querying public data, so bear with me. I am interested in chip-seq reads mapping to a specific region of the genome and I want to quickly query lots of different data sets for this region.

From googling around this is the method I came up with. I tried it first on a positive control region which should have lots of mapped reads according to the paper. I get zero reads mapped.

Here are my commands:

/#for prdm1 on H3K27ac in day 6 pgc:

sam-dump --aligned-region 10:44170511-44216948 --output-file Prdm1_SRR1539456.sam SRR1539456

/#convert sam to bam:

samtools view -S -b Prdm1_SRR1539456.sam > Prdm1_SRR1539456.bam

/#count the number of mapped reads:

samtools flagstat Prdm1_SRR1539456.bam

When I call flagstat this is what I get: 115516081 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary

0 + 0 supplementary

0 + 0 duplicates

0 + 0 mapped (0.00% : N/A)

0 + 0 paired in sequencing

0 + 0 read1

0 + 0 read2

0 + 0 properly paired (N/A : N/A)

0 + 0 with itself and mate mapped

0 + 0 singletons (N/A : N/A)

0 + 0 with mate mapped to a different chr

0 + 0 with mate mapped to a different chr (mapQ>=5)

The file Prdm1_SRR1539456.bam is 4.2G so it is not empty... what am I doing wrong? I'm really stumped.

Thanks in advance!

chip-seq alignment sra • 239 views
ADD COMMENTlink written 12 weeks ago by garbuzov0
gravatar for ATPoint
12 weeks ago by
ATPoint2.6k wrote:

The data you download are simply not aligned. You will have to download them as SRA, e.g. using the prefetch application, followed by fastq-dump or directly as fastq via the European Nucleotide Archive), which mirrors most NCBI data. The ENA offers download via Aspera (check the manual on the ENA pages), which allow downloads with up to 100Mb/s.

ADD COMMENTlink written 12 weeks ago by ATPoint2.6k

So, to clarify, you are saying the SRA file I am working with is not aligned so I have to do the alignment myself. I have to download the entire SRA file (That's what I'm trying to avoid.) From looking at the GEO how can I tell if the SRA is aligned?

ADD REPLYlink written 12 weeks ago by garbuzov0

Yes, the SRA archive typically contains the raw data from an experiment. To be honest, I've never seen that SRA contains aligned BAM or SAM files. This apparently exists sometimes (according to some brief google), but I have no recommendation on how to check that other than looking at the summary page for your accession number. I personally would also not trust any alignment I did not run myself, because you never know what exactly they aligned it to, so if you want to build a project on your data, I would recommend the following:

Sort out the accession codes you want and check the download paths on the ENA, e.g. for your file, the path would be:

Download the Aspera Connect client and follow the install manual. For download with aspera, you have to modify the path from ENA: becomes

Download with the following command:

 ASPERA_ENA="ascp -QT -l 300m -P33001 -i /full/path/to/your/asperaweb_id_dsa.openssh"
$ASPERA_ENA**vol1/fastq/SRR153/006/SRR1539456/SRR1539456.fastq.gz /target/dir

Check the manual on ENA for more details. From there, you will have to align the stuff and manually filter for the regions you are interested in.

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by ATPoint2.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1722 users visited in the last hour