Question: SRA Toolkit and bowtie2: 0.00% overall alignment rate - what am I missing?
1
gravatar for Bilal Akil
4.2 years ago by
Bilal Akil30
Sydney, Australia
Bilal Akil30 wrote:

Hi guys. I'm just getting started with bioinformatics and this is my first post on Biostars. Thanks in advance for all your help.

I've been trying to successfully acquire sample reads and align them for quite some time now, but things still seem to be going wrong. I'm using NCBI's Sequence Read Archive (SRA) to find sample data. I've been downloading and formatting that data using their SRA Toolkit, with tools like `prefetch` and `fastq-dump`. Then I've tried to align them using `bowtie2`.

I'll list the exact commands I've used:

prefetch SRR390728
sra-stat -x --quick SRR390728

This is what the statistics look like:

<Run accession="SRR390728" spot_count="7178576" base_count="516857472" base_count_bio="516857472" cmp_base_count="49610016">
  <Size value="193606607" units="bytes"/>
  <AlignInfo>
    <Ref seqId="GPC_000000394.1" name="5" .../>
    <Ref seqId="GPC_000000395.1" name="Y" .../>
    <Ref seqId="NC_000001.9" name="1" .../>
    <Ref seqId="NC_000002.10" name="2" .../>
    ...
    <Ref seqId="NC_000022.9" name="22" .../>
    <Ref seqId="NC_000023.9" name="X" .../>
    <Ref seqId="NC_001807.4" name="M" .../>
  </AlignInfo>
  <QualityCount>
    <Quality value="4" count="1285575"/>
    <Quality value="5" count="6274952"/>
    ...
    <Quality value="30" count="1110"/>
    <Quality value="33" count="4722"/>
  </QualityCount>
  <Databases>
    <Database>
      <Table name="PRIMARY_ALIGNMENT">
        <Statistics source="meta">
          <Rows count="12979096"/>
          <Elements count="467247456"/>
        </Statistics>
      </Table>
      <Table name="REFERENCE">
        <Statistics source="meta">
          <Rows count="616097"/>
          <Elements count="3080436051"/>
        </Statistics>
      </Table>
      <Table name="SEQUENCE">
        <Statistics source="meta">
          <Rows count="7178576"/>
          <Elements count="516857472"/>
        </Statistics>
      </Table>
    </Database>
  </Databases>
</Run>

To me this means there is alignment data in there, particularly in table: `PRIMARY_ALIGNMENT`. I guess raw sequencing data is also in the `SEQUENCE` table. `fastq-dump` documentation states that the `SEQUENCE` table is used by default, so the `--table` option is omitted from the first call.

fastq-dump -O ../output SRR390728
fastq-dump --table PRIMARY_ALIGNMENT -O ../output --fasta SRR390728

This leaves us with a couple of files in the `output` directory:

  • `SRR390728.fastq` - a `fastq` file containing the raw reads from the SRA file; and
  • `SRR390728.fasta` - a `fasta` file containing reads from the SRA file's `PRIMARY_ALIGNMENT` table.

Next, I move onto using `bowtie2` - first by indexing the aligned reads so we can use it as a reference:

bowtie2-build output/SRR390728.fasta output/bowtie/SRR390728

This outputs a whole bunch of files into the `output/bowtie` directory. Finally, I can try to align the raw reads:

bowtie2 -x output/bowtie/SRR390728 -U output/SRR390728.fastq -S output/SRR390728.sam

30 minutes later, I've got the following depressing output:

7178576 reads; of these:
  7178576 (100.00%) were unpaired; of these:
    7178576 (100.00%) aligned 0 times
    0 (0.00%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
0.00% overall alignment rate

The output `sam` file looks like this (without the starting line numbers):

       1 @HD VN:1.0  SO:unsorted
       2 @SQ SN:SRR390728.1  LN:36
       3 @SQ SN:SRR390728.2  LN:36
       4 @SQ SN:SRR390728.3  LN:36
...
...
...
12979095 @SQ SN:SRR390728.12979094   LN:36
12979096 @SQ SN:SRR390728.12979095   LN:36
12979097 @SQ SN:SRR390728.12979096   LN:36
12979098 @PG ID:bowtie2  PN:bowtie2  VN:2.2.5    CL:".../bowtie2-2.2.5/bowtie2-align-s --wrapper basic-0 -x output/bowtie/SRR390728 -S output/SRR390728.sam -U output/SRR39         0728.fastq"
12979099 SRR390728.1 4   *   0   0   *   *   0   0   CATTCTTCACGTAGTTCTCGAGCCTTGGTTTTCAGCGATGGAGAATGACTTTGACAAGCTGAGAGAAGNTNC    ;;;;;;;;;;;;;;;;;;;;;;;;;;;9;;665142;;;;;;;;;;;;;;;;;;;;;;;;;;;;;96&&&&(             YT:Z:UU
12979100 SRR390728.2 4   *   0   0   *   *   0   0   AAGTAGGTCTCGTCTGTGTTTTCTACGAGCTTGTGTTCCAGCTGACCCACTCCCTGGGTGGGGGGACTGGGT    ;;;;;;;;;;;;;;;;;4;;;;3;393.1+4&&5&&;;;;;;;;;;;;;;;;;;;;;<9;<;;;;;464262             YT:Z:UU
12979101 SRR390728.3 4   *   0   0   *   *   0   0   CCAGCCTGGCCAACAGAGTGTTACCCCGTTTTTACTTATTTATTATTATTATTTTGAGACAGAGCATTGGTC    -;;;8;;;;;;;,*;;';-4,44;,:&,1,4'./&19;;;;;;669;;99;;;;;-;3;2;0;+;7442&2/   
...
...
...
20157672 SRR390728.7178574   4   *   0   0   *   *   0   0   AGAAAAAGGATGAATNNNNNNNNNNNNNNNANNNNNNNNNNNATNNCTTCNNNNGTNNANNNNNNNNNNNNT    ;;;;;;-;;;;;;*)%%%%%%%%%%%%%%%6%%%%%&&&&&&;4&&;;;;&&&&.;&&;&&&&&&&&         &&&&3    YT:Z:UU YF:Z:NS
20157673 SRR390728.7178575   4   *   0   0   *   *   0   0   AGTTTTAATTTTTNATATTTTACTTCATAGTCTTTTACACATTTTAAAATGACCTAAATTAACGACATATCA    ;;;;;;;8;;;;;%;&3;;,;&+;;1:)8+504&5/0776;;;16/10&1/.1;4.;;**4;0&7&&         6&*-&    YT:Z:UU
20157674 SRR390728.7178576   4   *   0   0   *   *   0   0   AATATCACAGCGANCGCTATAGATCGGAAGATCGGTATAGCGGTCGCTGTGATATTAGATCGGAAGAGCGTC    ;;;;;;1;;;;;1%;;;75;55:;:::%5720+,2/;;;;;;;;;;;;<;:4;;;;;:9;;;:;;,4         42&42    YT:Z:UU

Note that some of the pasted text (particularly surrounding the read qualities) might look messed up thanks to the online editor.

It seems that all of those reads have `4` as the value for their flag. The bowtie2 manual describes `4` as: "The read has no reported alignments".

Why not? What am I missing? Why was nothing aligned?

sra sra-toolkit alignment bowtie2 • 2.4k views
ADD COMMENTlink modified 4.2 years ago by Devon Ryan91k • written 4.2 years ago by Bilal Akil30
4
gravatar for Devon Ryan
4.2 years ago by
Devon Ryan91k
Freiburg, Germany
Devon Ryan91k wrote:
  1. Try to rely as little as possible on SRA. The fastq files from fastq-dump are fine to use, but I would strongly encourage you to never use anything else.
  2. This dataset is paired-end, you need to use the --split-files option.
  3. Download the version of the human genome you would like to align against and build the bowtie2 indices with that. Alternatively, you can download premade indices from iGenomes (the files are huge though, it's often faster to just build things yourself).
ADD COMMENTlink written 4.2 years ago by Devon Ryan91k

Why do you recommend not relying on SRA? By first impression it looks like everything I need in terms of DNA samples can be found there.

I'm not sure which genome I'd like to align against. At the moment I'm just trying to figure out a good procedure for finding good DNA samples, downloading and formatting the data and getting alignment started.

Also, how did you figure that the dataset in the question was paired-ended? Was it somewhere in the output statistics?

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Bilal Akil30

Like I said, the fastq files are fine, but trying to get reference sequences and such from SRA entries is likely to go badly, as you've already experienced. The reality is that typically people only include fastq files in their entries and even if they include more the question becomes whether you should really trust that the person uploading things did everything in a way that's useful for you.

You can see that the dataset is paired-end from looking at its entry on SRA's website. It's actually best to simply always specify --split-files.

ADD REPLYlink written 4.2 years ago by Devon Ryan91k

Worked after using the `--split-files` option as you suggested, however the results have led to a second question: Paired bowtie2: 91% overall alignment but 0.01% concordant - should I be worried?

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Bilal Akil30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 877 users visited in the last hour