Question

Getting reference sequence for set of reads

0

Entering edit mode

3.1 years ago

martin.grasshoff • 0

I have a the reads from a single cell RNA seq experiment and I want to get the relevant reference sequence for them.

I subset these reads using samtools to only include the regions that interest me.

samtools view -@ 7 -q 30 -L $regions possorted_genome_bam.bam > output_sam.sam

I then also remove all reads, that do not have the CB field.

awk -F '\t' '{print $22}' output_sam.sam | sort | uniq -c | awk '{print $2}' | grep "CB" > $relevant_cells
grep -f $relevant_cells output_sam.sam > output_sam_cells.sam

From my reference fasta file, I get a region that overlaps with all my regions of interest. This reference fasta is the same file, I used for cellranger count.

bedtools getfasta -fi $REFERENCE_FASTA -bed $MY_REGION -fo region.fasta

I then use R to load the output_sam_cells.sam file and the region.fasta file. From the position of each read, I then subtract the beginning position of my larger region. That should give me the position for the string.

input$V4 <- input$V4 - region.start + 1  # + 1 because it is 1-based

But when I then compare the 2 sequences, they are completely different. For example read 1. The reads have a length of 91, so I take the sub string from the start (V4) to start +90.

 read <- input[1,]
 ref.subset <- substr(LARGER_REGION, start = read$V4, stop = read$V4 + 90)

Output:

read       "GCAGTGGTATCAACGCAGAGTACATGGGTAGGGTTTACGACCTCGATGTTGGATCAGGACATCCCAATGGTGTAGAAGCTATTAATGGTTC"
ref.subset "AGGGTTTACGACCTCGATGTTGGATCAGGACATCCTAATGGTGTAGCCGCTATTAAGGGTTCATTTATTCAATGATTAAAGTCCTATGTGA"

If i then take a sequence a bit upstream, I at least get a partial match.

 ref.subset <- substr(LARGER_REGION, start = read$V4 -29, stop = read$V4 -29 + 90)

Output:

 read      "GCAGTGGTATCAACGCAGAGTACATGGGTAGGGTTTACGACCTCGATGTTGGATCAGGACATCCCAATGGTGTAGAAGCTATTAATGGTTC"
ref.subset "GTGCAATCCTATTCTATCATATCAATAATAGGGTTTACGACCTCGATGTTGGATCAGGACATCCTAATGGTGTAGCCGCTATTAAGGGTTC"

But for the second read, I would have to use a window 31 positions upstream.

 read <- input[2,]
 ref.subset <- substr(LARGER_REGION, start = read$V4-31, stop = read$V4-31 + 90)

Output:

 read      "AAGCAGTGGTATCAACGCAGAGTACATGGGAGGGTTTACGACCTCGATGTTGGATCAGGACATCCCAATGGTGTAGAAGCTATTAATGGTT"
ref.subset "AGTGCAATCCTATTCTATCATATCAATAATAGGGTTTACGACCTCGATGTTGGATCAGGACATCCTAATGGTGTAGCCGCTATTAAGGGTT"

I do not expect a perfect match, but it seems that I am missing something very basic here.

RNA-Seq alignment samtools bedtools • 516 views

ADD COMMENT • link updated 3.1 years ago by Biostar 20 • written 3.1 years ago by martin.grasshoff • 0

1

Entering edit mode

awk -F '\t' '{print $22}' output_sam.sam | sort | uniq -c | awk '{print $2}' | grep "CB" > $relevant_cells
grep -f $relevant_cells output_sam.sam > output_sam_cells.sam

very dangerous if $22 isn't always the column for this tag , why uniq -c ? grep -f how about partial matching ? what if the tag appears elsewhere ?

I do not expect a perfect match, but it seems that I am missing something very basic here.

SOFT clip ?

ADD REPLY • link 3.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you for your answer.

very dangerous if $22 isn't always the column for this tag ,

Yes, not all lines have the same number of fields. But field 22 should be the cell barcode, if the read has been properly mapped. I will check this again.

why uniq -c ?

Thanks for noticing. I copied that part from a different script. There is no particular reason for counting.

grep -f how about partial matching ? what if the tag appears elsewhere ?

With this part, I just want to remove all the reads that are not correctly mapped. If I take a partial match, reads that are not correctly mapped could also be included. If the tag appears more than once, that would be okay. then it would just be a different read for the same cell.

SOFT clip ?

What do you mean by this? I assume that I should remove the first few nucleotides and just retain the matching part. But that would not solve the issue, that the reference sequence is completely wrong for the start position of the read.

ADD REPLY • link 3.1 years ago by martin.grasshoff • 0