Question

Unrecognized Sequence in NGS reads

0

Entering edit mode

3.7 years ago

jer364 • 0

Hi All,

I downloaded some SRA files from a paper that used single cell sequencing with unique barcodes for each cell. I took a closer look at some of the sequences that didn't map to a reference genome (the sample was a community standard so species were known) and hoped someone could help me interpret what I was looking at.

Example Sequence 1 : (Sorry if this isn't intuitive, I wanted to label known segments; 5' at top, read left to right and top to bottom, length=151)

Assuming this is genomic DNA 5' - GATTATGTCGCACTGTACCCGGAAAAATTAGCGGATATTAAG-

Nextera Adapter (unsure of the T) -T-CTGTCTCTTATACACATCTCCGA-

custom index sequence -GCCCACGAGACGTGTCGGGGCTGGCTTA-

barcode -TTAAACGGACCTAGA-

flow-cell adapter - CTATGCGGCATCAGAGCAGATTGTACTCGCTATTACGCCAGC - 3'

My understanding is that if the genomic fragment is small enough, sequencing may continue into the adapter, so no problems here. However, I don't know how to explain the next example sequence.

Example Sequence 2: (unknown sequence in parentheses at the end of the flow cell adapter)

partial Nextera Adapter (first nt should be a C) 5' - ATTATACACATCTCCGA-

custom index sequence - GCCCACGAGAGTGTCGGGCTGGCTTA-

barcode - TAGGGTCGCGGCCAG-

flow-cell adapter CTATGCGGCATCAGAGCAGATTGTACTCGCTATTACGCCAGCTGATCTCGTATGCCGTCTTCTGCTTG(ACCAAACATACTCTTTTCCTCTTCC) -3'

For the flow-cell adapter, the nts leading up to the portion in parentheses are complementary to the P7 adapter sequence for Illumina. If sequencing was carried out till the end, wouldn't the end be the last nts of the adapter, or are the nts in parentheses coming from a flow cell oligo? Also, how would sequencing begin upstream of the genomic DNA? Or am I completely misunderstanding something? Thank you so any help!

sequencing • 912 views

ADD COMMENT • link 3.7 years ago by jer364 • 0

1

Entering edit mode

If this is 10x data then Read1 consists of cell barcode and UMI only. That read does not contain any usable genomic sequence information. Read 2 contains the actual sequence. See this link for more.

ADD REPLY • link 3.7 years ago by GenoMax 141k

0

Entering edit mode

Hi @genomax,

These are unpaired reads. The only files are the R1 and Index files. So Read 1 contains what should be the genomic info, and Read 2 has the 15 bp barcode sequence.

ADD REPLY • link 3.7 years ago by jer364 • 0

0

Entering edit mode

Can you post the an example SRA# for the dataset that you are referring?

ADD REPLY • link 3.7 years ago by GenoMax 141k

0

Entering edit mode

Right now I'm working with a single sample, SRR5202186, to get a pipeline established.

ADD REPLY • link 3.7 years ago by jer364 • 0

0

Entering edit mode

Sorry, I forgot the index file wasn't included in the SRA upload. The files for that sample can be found on their GitHub https://github.com/AbateLab/SiC-seq linked in an issue posted by jessieren.

ADD REPLY • link 3.7 years ago by jer364 • 0