Confused with 2 SRA runs for one sample
2
1
Entering edit mode
7 months ago

Hello, I am completely new to Sequencing and programming (and I am blonde) - so please bear with me.

I already saw that there are some questions about it, but I could not really understand/deduce what I have to do now. So, I have the task to recreate some figures of a paper with RStudio. I choose the single-cell RNA-Seq results from this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8077299/ I already downloaded all SRA files via the SRAtoolkit and I am already converting them in FastQ files with the split-3 option. And I know I have to check them with FastQC afterwards. But, there are for each sample two SRA runs (this sample for example: https://www.ncbi.nlm.nih.gov/sra/SRX8998846[accn] ) Why are there 2 SRA files, which will result in ultematively in 4 FastQ files for one sample? I have read somewhere "technical duplicates", but there is also this huge difference in size (6.9 GB and 18.3 GB) and if I am clicking on the runs to get more informations I get lost.

Can someone please explain to me why there are 2 SRA files are and how I should proceed with them?

runs Illumina RNASeq sra • 1.1k views
1
Entering edit mode
7 months ago
GenoMax 111k

With 10x data I recommend that you always try and get the original data submitted. It can be found under Data Access tab (one of your SRA recs) and for a change seems to be available at no cost via cloud (not always the case). Larger file would be the actual read and the smaller one cell barcode +UMI.

Both runs are for the same biosample. So they could be biological or technical replicates. See if there is more information in the publication.

Edit: Looking at the sample entry in GEO one could be scRNA-seq and other is TAP-seq sample.

0
Entering edit mode

Thank you. Then I will read the publication again. If the second SRR run is only containing cell barcode + UMI, how should I proceed? Roughly speaking. Was it something along the lines of aligning them and trimming the ends?

Yeah actually there are three SRR runs for one sample. Two with the same SAMN and SRX number (where I am/was confused) and the other with different SAMN/SRX numbers. That much, I was able to recognize. But everything beyond is still like magic for me.

0
Entering edit mode

I was referring to two SRA runs for the SRX number you had linked above. Since both are using the same number of cycles I am not sure which one is TAPseq (I am not familiar with that technique). You can link SRR number you are referring to if you want us to take a look.

0
Entering edit mode

This link is giving a general overview of all samples. https://www.ncbi.nlm.nih.gov/Traces/study/?query_key=1&WebEnv=MCID_60d31b728740610d17105811&o=acc_s%3Aa And we see that each sample is listed three times with completely different sizes in bytes and bases.

The third one of each triplet is containg the TAP-Sequences. This, they have written (https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR12508051). But not, what exactly the difference between the first two is (https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR12508049 and https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR12508050).

0
Entering edit mode

I think you should email the submitters and ask. It is confusing since the top two entries seem to have no distinguishable metadata other than different run numbers.

Run,Assay Type,AvgSpotLen,Bases,BioProject,BioSample,Bytes,Center Name,Consent,DATASTORE filetype,DATASTORE provider,DATASTORE region,Experiment,GEO_Accession (exp),Instrument,LibraryLayout,LibrarySelection,LibrarySource,Organism,Platform,ReleaseDate,Sample Name,source_name,SRA Study,tissue,treatment
SRR12508049,RNA-Seq,128,37726134656,PRJNA658984,SAMN15894110,18575378741,GEO,public,"fastq,sra","gs,ncbi,s3","gs.US,ncbi.public,s3.us-east-1",SRX8998836,GSM4743592,NextSeq 500,PAIRED,cDNA,TRANSCRIPTOMIC,Homo sapiens,ILLUMINA,2021-04-13T00:00:00Z,GSM4743592,Colon Organoids,SRP278628,Colon Organoids,Mock
SRR12508050,RNA-Seq,128,13804061952,PRJNA658984,SAMN15894110,6915281395,GEO,public,"fastq,sra","gs,ncbi,s3","gs.US,ncbi.public,s3.us-east-1",SRX8998836,GSM4743592,NextSeq 500,PAIRED,cDNA,TRANSCRIPTOMIC,Homo sapiens,ILLUMINA,2021-04-13T00:00:00Z,GSM4743592,Colon Organoids,SRP278628,Colon Organoids,Mock

0
Entering edit mode

Seems I have to do this.

I really appreciate your help. Thank you very much :)

0
Entering edit mode

Please post their clarification here if you get it from them. Would be interesting to know.

0
Entering edit mode
6 months ago

Hey I am sorry for the long wait. Had/have quite a lot to do.

This was their one and only response: "Each sample/library was run in two lanes of Hiseq, one full and 1/3 of another, thats why they have different size. if you want to run cellranger with the reads to replicate our study you would need to use both runs as an input."

0
Entering edit mode

That solves that mystery then. This would not have been apparent without hearing back from submitters.

0
Entering edit mode

If you may answer another question according to that: Now I have 2 SRR files, in other terms 4 FastQ files, for each sample. Can I simply merge both forward fastq.files together and the same with the reverse reads?

I somewhere red that this is not that much recommended as there might be issues like fragment length penalties. Is this true? Can I somehow prevent/avoid this?

0
Entering edit mode

Based on what you wrote above it appears that the same library was sequenced on two lanes making this a technical sequencing replicate. So it should be ok to merge the two files in same order for both R1/R2 reads.

0
Entering edit mode

Perfect. Again, thank you very much!