Question: What Is A "Spot" In Sra Format
10
gravatar for Daniel Standage
7.4 years ago by
Daniel Standage3.8k
Davis, California, USA
Daniel Standage3.8k wrote:

I'm using the SRA toolkit to convert some SRA files to Fastq format. I've been looking at the documentation to make sure I'm doing things right, and the word spot keeps coming up. My question is twofold.

  1. What is a spot and how does it differ from a read?
  2. Where is this (officially) documented (or is it)?

The reason I've separated these two questions is that I think I know the answer to the first one, but I'm not sure and I can't find the answer in any of the documentation or online. Also, I expect more people will know the answer to question #1 than question #2.

sra fastq format • 9.4k views
ADD COMMENTlink modified 2.6 years ago by rrr40 • written 7.4 years ago by Daniel Standage3.8k
12
gravatar for doyourun
3.8 years ago by
doyourun120
United States
doyourun120 wrote:

This is the description I received the following description from the SRA staff (Adam Stine).

The spot model is Illumina GA centric.  The flowcells have the locations where the adapters have stuck them to the glass of the lane.  There are X and Y coordinates that identify these 'spots'.   As the camera reads the fluorescent flashes during sequencing, the coordinates indicate which spot the new base is added to.   All of the bases for a single location constitute the spot.  There may be one or more divisions of those bases for technical reads (adapters, primers, barcodes, etc) and there will always be at least one biological read (forward, reverse).  I usually think of the technical reads as the "known" sequence and the biological as the "unknown".  When we store the data, the bases for a single spot are all stored as one string with the description of where the breaks occur as well as the type of read each segment represents.  The spot length is the expected total length for all reads (used as a check to make sure we have all the data).  As an example, a 2x150 run with a 6bp barcode and 12bp primer on the forward read would have 4 reads.

0 - barcode basecoord 1

1 - primer  basecoord 7

2 - forward basecoord 19

3 - reverse basecoord 151

---------------------------

But you only need to explain SRA about the barcode and primer is you submit sequences that contains it..In my case, a third party provided me with the BAM files and I do not have the untrimmed sequences.

So the SPOT datamodel is useful for supplying untrimmed BAM.. yet, enable you to specify where the biological reads begin. 

In my case, I have 2X100 bp without index and I am only supplying the Application read with the adapter trimmed. so I simply submit.

0 - forward basecoord 1 (Application read)

1 - reverse basecoord 101 (Application read)

-----------------------------------------------------------

 

 

ADD COMMENTlink written 3.8 years ago by doyourun120

thanks for tracking this down and updating/posting here - I've always used a handwaving explanation for this

ADD REPLYlink written 3.8 years ago by Istvan Albert ♦♦ 79k
6
gravatar for Stefano Berri
7.4 years ago by
Stefano Berri4.1k
Cambridge, UK
Stefano Berri4.1k wrote:

Hi.

I think a "spot" is where the read comes from. The spot might contain more than the read. The difference is that the "spot" could all the "technical" information (adapter, tags, barcoding sequences) whereas the read is the actual biological sequence you are after. In many cases, however, spot and read coincide.

I don't know of any official documentation: the closest I could get is the description on how to make the xml files associated to the submission.

Good luck! If you discover anything in this regard, post it!

ADD COMMENTlink written 7.4 years ago by Stefano Berri4.1k
6
gravatar for Varun Gupta
7.1 years ago by
Varun Gupta1.1k
United States
Varun Gupta1.1k wrote:

Hi I agree with Stefano. Spot does contain more than a read. I didn't find any official document to prove this but actually when we use fastq-dump on the sra file so as to convert it into a fastq files , after completion it is written that "Written 38424688 spots for SRR032.sra" Now if we look at the fastq file, each read has 3 more things attached to it starting with @. Something like this

@SRR032238.12186 HWI-EAS6:3:1:246:1981 length=50 GGCCAGCTCTACACCTTCAAGGCCGAGACGGAGGAGCTGAAGGGANGCTG

+SRR032238.12186 HWI-EAS6:3:1:246:1981 length=50 BBB@=@BBBABBBBBBBB>0>BBB@6@A?446/8+;AAA@=9(7-!817&

In total each read has 4 lines. Now count the number of lines in your fastq file and divide it by 4. That would give you the same number as i mentioned above i.e 38424688(IT WOULD BE DIFFERENT FOR DIFFERENT FILES OFCOURSE) SO a spot contains 4 lines in fastq of which read is a part.

Hope this helps

ADD COMMENTlink written 7.1 years ago by Varun Gupta1.1k

Isn't your explanation a round about way of saying that number of spots is exactly the same as number of reads? Which is not always true.

ADD REPLYlink written 3.3 years ago by rdbcasillas110
3
gravatar for rrr
2.6 years ago by
rrr40
rrr40 wrote:

Official (but not as helpful as the above) explanations are here:

http://www.ncbi.nlm.nih.gov/books/NBK54984/

http://www.ncbi.nlm.nih.gov/books/NBK47533/

So a spot is all the info you got from one "spot" on the flow cell. This is not the same as reads. You get 4 reads per spot with today's illumina sequencing: forward barcode, forward read, reverse barcode, reverse read. Straight out of the instrument these are 4 different files. Straight out of SRA database... you specify what parts you want. SRA links all those reads together with a "spot" identifier, and you can use that to match up paired reads later. At least that is my interpretation of their and your descriptions.

ADD COMMENTlink written 2.6 years ago by rrr40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1327 users visited in the last hour