How to append the cell barcode and UMI information to the fastq header in paired-end single-cell RNA-seq data?
2
1
Entering edit mode
3.2 years ago
sc243 ▴ 10

Hello everyone,

I have a question regarding my single-cell RNA-seq data. I have the following pair-end data in fastq.gz format.

Read1 (contains 6bp UMI, followed by 6bp cell barcode info and the rest is a polyT stretch):

@J00182:79:HV2WWBBXX:6:1101:11160:38873 1:N:0:ACAGTG
GAGAAGACAGTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTT
+
AAAFFJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJ--A-----7AJJFAF-AJJJJJJJJJJJ<A<----A


Read2 is the normal read that I am using for mapping to the reference and the corresponding pair-end mate of the above read looks like this-

@J00182:79:HV2WWBBXX:6:1101:11160:38873 2:N:0:ACAGTG
GCATACTTATTTCCAAACTTTTGGAAAAAGCATAATTTGACAAAAAAGAATACAATTTTTTGCTGTTTCAACCAC
+
A<<AFJFJJJJJJFJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ


Now I would like to append the cell barcode and UMI info from the read1 sequence in front of the header of my read2 in the following format- @6bpCellbarcode_6bpUMI#Read2header (with an underscore in between Cellbarcode and UMI and a hash between UMI and the rest of the header).

Example output-

@ACAGTG_GAGAAG#J00182:79:HV2WWBBXX:6:1101:11160:38873 2:N:0:ACAGTG
GCATACTTATTTCCAAACTTTTGGAAAAAGCATAATTTGACAAAAAAGAATACAATTTTTTGCTGTTTCAACCAC
+
A<<AFJFJJJJJJFJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ


ACAGTG is the cell barcode and GAGAAG is the UMI. Note that the order is flipped here in the output as Read1 first contains UMI and later the cell barcode while the output I need is vice versa.

Can someone please tell me how to do that?

as usual, thank you so much!

rna-seq fastq sequence single-cell • 2.9k views
0
Entering edit mode

fastq.gz - is that two fastq files or one interleaved ?

1
Entering edit mode

Interleaved fastq files are an atrocity against nature.

0
Entering edit mode

Hi both of you! UMI tools worked perfectly! Thanks for the parse code too. It saved me some time to write my own! Thanks again!

4
Entering edit mode
3.2 years ago

Do yourself a favour and simply use umi_tools extract. The headers won't be formatted how you want, but will instead be formatted in a way more compatible with other tools.

4
Entering edit mode
3.2 years ago

Obviously as the UMI-tools author, I'm going to agree with @Devon Ryan! However, if you really want to do it this way I can think of two options:

Method 1: Extract the the UMI and CB using UMI tools and then reformat the headers. So, using the Biopython SeqIO fastq parser (I'm sure better parsers are available)

from Biopython import SeqIO
outfile = open("output.fastq", "w")
new_name = "%(cb)s_%(umi)s#%(id)s" % locals()

outfile.close()


There are definaly more efficient ways to write this, but I think this is the clearest. I also don't remember how SeqIO deals with the @, so you'll need to look into that.

Method 2: Write your own parsing code.

Take a look at extract.py from umi_tools, its not particulalry complex, and it should give you an idea for how to do the processing yourself.