Question

How to append the cell barcode and UMI information to the fastq header in paired-end single-cell RNA-seq data?

1

Entering edit mode

5.6 years ago

sc243 ▴ 10

Hello everyone,

I have a question regarding my single-cell RNA-seq data. I have the following pair-end data in fastq.gz format.

Read1 (contains 6bp UMI, followed by 6bp cell barcode info and the rest is a polyT stretch):

@J00182:79:HV2WWBBXX:6:1101:11160:38873 1:N:0:ACAGTG
GAGAAGACAGTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTT
+
AAAFFJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJ--A-----7AJJFAF-AJJJJJJJJJJJ<A<----A

Read2 is the normal read that I am using for mapping to the reference and the corresponding pair-end mate of the above read looks like this-

@J00182:79:HV2WWBBXX:6:1101:11160:38873 2:N:0:ACAGTG
GCATACTTATTTCCAAACTTTTGGAAAAAGCATAATTTGACAAAAAAGAATACAATTTTTTGCTGTTTCAACCAC
+
A<<AFJFJJJJJJFJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

Now I would like to append the cell barcode and UMI info from the read1 sequence in front of the header of my read2 in the following format- @6bpCellbarcode_6bpUMI#Read2header (with an underscore in between Cellbarcode and UMI and a hash between UMI and the rest of the header).

Example output-

@ACAGTG_GAGAAG#J00182:79:HV2WWBBXX:6:1101:11160:38873 2:N:0:ACAGTG
GCATACTTATTTCCAAACTTTTGGAAAAAGCATAATTTGACAAAAAAGAATACAATTTTTTGCTGTTTCAACCAC
+
A<<AFJFJJJJJJFJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

ACAGTG is the cell barcode and GAGAAG is the UMI. Note that the order is flipped here in the output as Read1 first contains UMI and later the cell barcode while the output I need is vice versa.

Can someone please tell me how to do that?

as usual, thank you so much!

rna-seq fastq sequence single-cell • 5.1k views

ADD COMMENT • link 5.6 years ago by sc243 ▴ 10

0

Entering edit mode

fastq.gz - is that two fastq files or one interleaved ?

ADD REPLY • link 5.6 years ago by jomo018 ▴ 720

1

Entering edit mode

Interleaved fastq files are an atrocity against nature.

ADD REPLY • link 5.6 years ago by Devon Ryan 104k

0

Entering edit mode

Hi both of you! UMI tools worked perfectly! Thanks for the parse code too. It saved me some time to write my own! Thanks again!

ADD REPLY • link 5.6 years ago by sc243 ▴ 10

score 6 · Accepted Answer · 2018-09-11

Obviously as the UMI-tools author, I'm going to agree with @Devon Ryan! However, if you really want to do it this way I can think of two options:

Method 1: Extract the the UMI and CB using UMI tools and then reformat the headers. So, using the Biopython SeqIO fastq parser (I'm sure better parsers are available)

from Biopython import SeqIO
outfile = open("output.fastq", "w")
for read in SeqIO.parse("my_extracted_read.fastq", "fastq"):
    header = read.id
    read_name = header.split()[0]
    id, cb, umi = read_name.split("_")
    new_name = "%(cb)s_%(umi)s#%(id)s" % locals() 
    read.id = new_name + " " + " ".join(read_name[1:])
    SeqIO.write(read, outfile, "fastq")

outfile.close()

There are definaly more efficient ways to write this, but I think this is the clearest. I also don't remember how SeqIO deals with the @, so you'll need to look into that.

Method 2: Write your own parsing code.

Take a look at extract.py from umi_tools, its not particulalry complex, and it should give you an idea for how to do the processing yourself.

score 4 · Accepted Answer · 2018-09-11

4

Entering edit mode

5.6 years ago

Devon Ryan 104k

Do yourself a favour and simply use umi_tools extract. The headers won't be formatted how you want, but will instead be formatted in a way more compatible with other tools.

ADD COMMENT • link 5.6 years ago by Devon Ryan 104k