Question: Sra Format Questions: How To Get A Sam File That Can Be Converted To Bam?
2
gravatar for bede.portz
5.0 years ago by
bede.portz470
United States
bede.portz470 wrote:

I'm trying to obtain some published chip-seq data from another lab that is stored in the SRA. I have downloaded and installed the SRA toolkit. I am having some problems obtaining a SAM file, that I can convert to BAM, and ultimately, BED. I was hoping Biostars could clarify some things, I found the SRA toolkit documentation to be difficult to understand.

First of all, do all SRA files contain information that can be ouput as a sam file using sam-dump? Is there a standard set of data types that are included in an SRA file, or can it vary from dataset to dataset? If so, how does one know what type of data is included in a given SRA file?

I may well be misunderstanding many things, but I will include the steps I have attempted that ultimately result in a .bam file that cannot be converted to a .bed file. Any help in learning to access SRA files and transfer them to sam and then bam would be much appreciated.

  1. First I try to obtain a sam file using sam-dump

    $sam-dump SRR490207.sra > SRR490207.sam

Here is the first few lines of the output:

 HWUSI-EAS1694_0008:5:1:1061:20521    4    *    0    0    *    *    0    0    TGTGATCTGACCTTACCAATCTTTNCNNNNNNNNNN    DFFFBFDDFFFFFFFFFFFB@=B@############
HWUSI-EAS1694_0008:5:1:1061:19640    4    *    0    0    *    *    0    0    AATTCTACAACATCTCCAACAAATNTNNNNNNNNNN    DDDFFFFFFFFFEFEFFFFFCDC#############
 HWUSI-EAS1694_0008:5:1:1061:6084    4    *    0    0    *    *    0    0    TATCCCAAGCTACTCCGGGCCTGCNTNNNNNNNNNN    GGGGGEFFBFDGGFGFF?EECAAB############
  1. Next I try to convert the .sam file to a .bam file using samtools, using the code:

    $samtools view -S -b SRR490207.sam > SRR490207.bam

I get the following message:

[samopen] no @SQ lines in the header.

I do get a .bam file, which I try to sort using samtools, but again I get an error regarding the header:

samtools sort SRR490207.bam sortedSRR490207 
[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_sort_core] truncated file. Continue anyway.
[bam_sort_core] merging from 3 files...

Finally, When I attempt to convert the .bam file to a bed file, using bamToBed I get an empty .bed file.

Can anyone help me take an SRA file to a .bam file?

Thanks

sra samtools sam chip-seq • 10k views
ADD COMMENTlink modified 12 months ago by Biostar ♦♦ 20 • written 5.0 years ago by bede.portz470
6
gravatar for Matt Shirley
5.0 years ago by
Matt Shirley8.8k
Cambridge, MA
Matt Shirley8.8k wrote:

The first thing you want to understand is that not every SRA entry has aligned reads. Since you want a BED file, which contains features, you will be interested in the aligned data. There is not a great way to check on this. First way is to use the SRA trace browser and look at the summary page (see the page for the accession in your example). There should be an "alignment" tab if there are aligned data in the SRA accession. In your example there is not.

Just for completeness I'll continue though. The second way to check for aligned data is to run the vdb-dump tool on your .sra archive and look for ALIGNMENT_COUNT in the resulting output.

There are no @SQ lines in the header because the @SQ lines are the "sequence library" which is a representation of the contigs in the reference that the reads were aligned to. Since your reads are not aligned this information is not present.

Finally your .bed file is empty because there are no alignments to extract features from.

As a possible solution you might try the fastq-dump tool to extract the reads in FASTQ format, and since the data are annotated as human ChIP-seq data you can use either bowtie2 or bwa to align the reads to a human reference genome.

ADD COMMENTlink written 5.0 years ago by Matt Shirley8.8k

Subsequent to my post I began to suspect something to the effect of what you described. fastq-dump does get result in a fastq file for another dataset from the same set of experiments, so I may just have to do the mapping myself as you suggest. Thanks.

ADD REPLYlink written 5.0 years ago by bede.portz470
1
gravatar for Gabriel R.
5.0 years ago by
Gabriel R.2.5k
Center for Geogenetik Københavns Universitet
Gabriel R.2.5k wrote:

wait... that has nothing to do with the SRA, the convertion to bam has had a problem. Can you do the following for me post the output of:

cat SRR490207.sam | wc -l

and

samtools view SRR490207.bam | wc -l

They should be the same, I suspect that they are not...

ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by Gabriel R.2.5k
$wc -l SRR490207.sam
29915343 SRR490207.sam

$samtools view SRR490207.bam | wc -l
[bam_header_read] EOF marker is absent. The input is probably truncated.
[main_samview] truncated file.
8897161
ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by bede.portz470

well there you go, something went awry during the sam->bam conversion. reconvert to bam and make sure both files tally to the same number. good luck !

ADD REPLYlink written 5.0 years ago by Gabriel R.2.5k

There is definitely an issue with the bam file being incomplete, but I believe that even the complete bam file will be insufficient for the OP's analysis since it is all unmapped reads.

ADD REPLYlink written 5.0 years ago by Matt Shirley8.8k
1
gravatar for genomax
16 months ago by
genomax62k
United States
genomax62k wrote:

Since someone appears to have touched this thread today (up-votes) I will add that a new blast program magicblast available from NCBI can directly accept SRA accession ID's and generate a SAM file (that can be converted BAM following usual means).

Edit: Included here for information. Make your own decision about using it.

ADD COMMENTlink modified 16 months ago • written 16 months ago by genomax62k
0
gravatar for Ming Tang
4.8 years ago by
Ming Tang2.4k
Houston/MD Anderson Cancer Center
Ming Tang2.4k wrote:

those sra files are not aligned. use fastq-dump first to get fastq file and then map with bowtie or tophat(if it is RNA-seq)

ADD COMMENTlink written 4.8 years ago by Ming Tang2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1546 users visited in the last hour