Tutorial: Integrating VDJ sequencing data with Seurat
gravatar for atakanekiz
15 months ago by
atakanekiz250 wrote:

Jared.andrews07 wrote a previous tutorial for integrating TCR/VDJ sequencing data with Seurat object. I wanted to expand on this vignette to automate some data cleanup especially for Seurat objects created by combining more than one sequencing run. Unlike Jared's script, the method below will also retain information other than clonotype data (such as V, D, J genes sequenced, the sequence of CDR regions etc).

The code below was written for 10X genomics single-cell sequencing data. In my case, I needed to load the TCR clonotype data after I created a combined Seurat object by aggregating 4 separate runs. This script can be edited to work with a Seurat object created from a single run; or alternatively for loading VDJ data early on before aggregating the samples.

Prepare combined Seurat objects

Since I have 4 separate runs and cell barcodes are duplicated between preps, I appended sample information before the barcode for as see below. This makes the barcodes across experiments unique.

    # An example showing column renaming
    # n1 is the name of the sample (the others are n2, n3, n4)

    n1_data <- Read10X("data/EXPIDX1/")
    n2_data <- Read10X("data/EXPIDX2/")
    n3_data <- Read10X("data/EXPIDX3/")
    n4_data <- Read10X("data/EXPIDX4/")

colnames(n1_data) <- paste("n1", colnames(n1_data), sep = "_")
colnames(n2_data) <- paste("n2", colnames(n2_data), sep = "_")
colnames(n3_data) <- paste("n3", colnames(n3_data), sep = "_")
colnames(n4_data) <- paste("n4", colnames(n4_data), sep = "_")

    # Follow Seurat vignettes to aggregate 4 runs to generate "combined" Seurat object 

 # add.cell.ids argument can be used when you perform the merge step instead of this approach of prefixing 
 # Thanks to Jared for the suggestion

Correct VDJ-seq data barcodes

This is needed because in the previous chunk we manually changed the barcodes from something like this ATTAGGATTAGGTC to something like this n4_ATTAGGATTAGGTC. We need to make sure the VDJ data frame has matching cell barcodes. Also, the VDJ sequencing data have -1 at the end of the barcode name, which needs to removed for correct barcode matching.

# Load VDJ data (one csv per run)

n1_clone <- read.csv("data/EXPIDX1_filtered_contig_annotations.csv")
n2_clone <- read.csv("data/EXPIDX2_filtered_contig_annotations.csv")
n3_clone <- read.csv("data/EXPIDX3_filtered_contig_annotations.csv")
n4_clone <- read.csv("data/EXPIDX4_filtered_contig_annotations.csv")

# Create a function to trim unwanted "-1" and append sample information before barcodes

barcoder <- function(df, prefix,  trim="\\-1"){

  df$barcode <- gsub(trim, "", df$barcode)
  df$barcode <- paste0(prefix, df$barcode)


# Override barcode data with the suitable replacement

n1_clone <- barcoder(n1_clone, prefix = "n1_")
n2_clone <- barcoder(n1_clone, prefix = "n2_")
n3_clone <- barcoder(n1_clone, prefix = "n3_")
n4_clone <- barcoder(n1_clone, prefix = "n4_")

Combine VDJ-seq data from separate runs and organize data

We will now bring together VDJ data from separate runs to be able to load it into the combined Seurat object. In the VDJ data I have, cell barcodes are duplicated in rows because the sequencing of TRA and TRB genes creates multiple data points for the same cell. Jared's tutorial removes duplicates and just utilizes the clonotype info, but some researchers may want to have a more granular detail level in the meta.data slot of the Seurat object (such as the actual V/D/J genes sequenced).

So, what we want to do now is concatenating all the unique data we'd like to keep (which result in the duplication of the cell barcodes in the data frame) and collapse the duplicate rows. I'm sure there are other (and maybe better) ways of doing this, but I ended up using the code below:

# Combine VDJ data from 4 separate experiments

all_cln <- rbind(n1_clone,

# Generate a function that will concatenate unique data entries and collapse duplicate rows
# To do this, I first factorize the data and then get factor levels as unique data points
# Then I paste these  data points together separated with "__" to access later on if needed

data_concater <- function(x){

  x<- levels(factor(x))

  paste(x, collapse = "__")


# Use data.table package for efficient calculations. Dplyr takes a long time in my experience.


all_cln <- as.data.table(all_cln)

# Prepare a progress bar to monitor progress (helpful for large aggregations)

grpn = uniqueN(all_cln$barcode)
pb <- txtProgressBar(min = 0, max = grpn, style = 3)

# This code applies data_concater function per  barcodes to create a 
# concatenated string with  the information we want to keep

all_cln_collapsed <- all_cln[, {setTxtProgressBar(pb,.GRP); lapply(.SD, data_concater)} , by=barcode]

# Or you can use the simple code below if you don't care about the progress bar
# all_cln_collapsed <- all_cln[, lapply(.SD, data_concater), by=barcode]

# Assign row names for merging into combined Seurat object

rownames(all_cln_collapsed) <- all_cln_collapsed$barcode

Assign VDJ data to Seurat object

Now we will assign all the VDJ-sequencing data to combined Seurat object. Individual columns of VDJ-seq data will be assigned as columns of combined@meta.data data frame.

combined <- AddMetaData(combined, metadata = all_cln_collapsed)

Hopefully, this helps to integrate VDJ-TCR seq data with genome-wide gene expression data in your experiments. Let me know if you have any suggestions.

Best, Atakan

ADD COMMENTlink modified 10 months ago • written 15 months ago by atakanekiz250

Hey, this is great! One note would be that you don't need to manually add the sample prefix to the barcode in your first step. You can use add.cell.ids when you perform the merge step.

ADD REPLYlink written 15 months ago by jared.andrews077.2k

Thanks for the suggestion. I was recycling that code snipped since early times of Seurat. I gotta keep up with times!

ADD REPLYlink written 15 months ago by atakanekiz250

Also, it may be worthwhile to remove/ignore columns from your VDJ data that are FALSE or None in the "productive" column, as you'll end up with a lot of extra noise otherwise. Actually, it might be best to just get the clonotype id for each cell, then get the gene information from the consensus_annotations.csv file.

ADD REPLYlink written 15 months ago by jared.andrews077.2k

Hi Jared,

How exactly would one remove/ignore columns from VDJ data with FALSE or None? Is there a command to ignore "FALSE" or "None" values?

Thank you!

ADD REPLYlink written 5 months ago by divya.samuel0

You could use base R functionality or data wrangling packages such as dplyr for subsetting the VDJ data prior to Seurat.

ADD REPLYlink modified 5 months ago • written 5 months ago by atakanekiz250

Honestly, I tend to use cellaRepertorium or scRepetoire for this now, as this both do this in a more elegant way.

ADD REPLYlink written 5 months ago by jared.andrews077.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 840 users visited in the last hour