programmatic fast search of 100-1000 short reads to a public server and obtain list of nearby genes
Entering edit mode
9.7 years ago

What are the options for programmatic fast searching of 100-1000 short reads to a public server and obtain list of nearby genes where the reads map?

Input: ~100-1000 short reads

Output: GFF list of genes it maps to or nearby genes

Genome UCSC is restrictive to the number of searches and AFAICS it won't allow programmatic use.

blat • 3.2k views
Entering edit mode

what is the target database that you want to search

Entering edit mode

Starting with human but ideally including common contaminants (viruses, bacteria) from a human blood sample.

Entering edit mode
9.7 years ago

I don't know of any online tool that supports rapid online querying.

For any situation where you want to match reads fast and efficiently the best solution is to download the genome and annotations and run the alignment and matching yourself. It is probably a five line script that people here could easily advise you on.

If that is not an option then your best bet is to use Galaxy and create a workflow that chains together the actions that you need.

Entering edit mode
9.7 years ago

This can be done offline and wont require too much of computational resource.

What you will need:

  • A fast short read aligner such as STAR or even bowtie (STAR is faster)
  • Genome sequence (you will have to build index for the genome for your aligner)
  • A GTF annotations file (get it from GENCODE or any other standard genome repository for your organism of interest)

Before you align remove redundant reads. Keep their counts if necessary.

Align the reads using any of these aligners and obtain the alignment co-ordinates. The default output is SAM format for STAR and a tabular format for bowtie (bowtie also gives SAM).

  • Column-3 of SAM shows the name of the reference sequence where alignment happened (chromosome)
  • Column-4 is sequence start
  • Column-10 is read sequence. Add the length of this to column 4 value to get stop site.

The columns are tab separated

Now define a window that you define as proximal/nearby (lets say 500nt).

Now all you have to do is find genes that lie ±500nt from your start/stop sites. In your reference GTF parse for the lines that have the feature "gene".

I am giving an example using awk. You can use any programming language you are comfortable with. Check the GTF format also.

Assuming that you made a file (reads.txt) from your SAM output in this format:

Chromosome <tab> Orientation (+/-) <tab> Start <tab> Stop

I am giving an example awk script:




        a[$1 FS $2][$3 FS $4] # store the co-ordinate information from your reads file

    $3=="gene" && ($4 FS $7) in a{ # quick parse for reference chromosome and orientation

        i=$4 FS $7
        for(j in a[i]){
            if(jj[1]>=$4 && jj[2]<=$5)
                print $0";contained"

            else if($4<=jj[2]+500 || $5>=jj[1]-500)
                print $0";partial overlap/proximal"

call it like this:

awk -f example.awk reads.txt annotations.gtf

NOTE: In the above script I have not considered antisense proximity. If you want to allow that then don't parse for orientation. Also gawk version < 4.0 doesn't allow multidimensional arrays. So install gawk>=4.0

The output is by default a GTF because you are printing selected lines of the reference GTF.

Entering edit mode
9.7 years ago

Answering my own question. Triggered by another user's request, there is now a "bwa shm" command in a dev branch of bwa:

git clone
cd bwa
git checkout -b dev remotes/origin/dev

./bwa shm $index # load the index into shared memory
./bwa mem ref.fa reads.fq
./bwa shm -l # list indices in shared memory
./bwa shm -d # unload all indices

The second part of the question can be adapted from another of the answers like this:

/opt/development/tests/bwa/bwa mem \
  /opt/scratch/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa \
  reads.fasta \
  2>/dev/null | \
  samtools view \
  -b \
  -t $index \
  - | \
  bedtools bamtobed -i - | \
  bedtools intersect \
  -b - \
  -a /opt/development/tests/annotations/gencode.v19.annotation.gene.gtf | \
  perl -lne 'print $1 if ($_ =~ /transcript_name\ \"(\S+)\"/)' | \
  sort | \
  uniq -c

      1 ERBB4
      2 PTPN11
      9 RAD50
Entering edit mode
9.7 years ago

Take a look at NCBI SRA-BLAST.


Login before adding your answer.

Traffic: 2226 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6