Entering edit mode
4 months ago
Ana
▴
10
Hi all, I want to obtain polyA tail lengths for single cell RNA-seq data. I have paired-end reads, and I planned on counting the Ts on R1 and then finding the corresponding barcodes in R2 so I could integrate that info into my Seurat object. This is the code I am using to obtain a table with barcodes and polyA tail lengths:
awk 'NR%4==2' sp_Hyun_S1_L001_R1_001.fastq.gz > r1_seqs.txt
awk '{ print gsub(/T/, "", $0) }' r1_seqs.txt > polyA_lengths.txt
paste whitelist_test.txt polyA_lengths.txt > barcode_polyA.tsv
But the file "barcodes_polyA.tsv" doesn't have the structure I expected (not every polyA tail length has a barcode associated). I appreciate any insights into what could be happening. Thanks!
I don't think that is possible .. unless the inserts are short and R2 reads into the barcode at the other end crossing the poly-A tail. Here is the library structure of 10x libraries (in general) which should make it clear: https://cdn.10xgenomics.com/image/upload/v1660261286/support-documents/CG000108_AssayConfiguration_SC3v2.pdf
If there is no valid barcode associated with that R1 read, then that read pair may need to be discarded since that library fragment may not be a valid 10x library construct.