Question: add UMI sequences to fastq read name
1
gravatar for User6891
19 months ago by
User6891270
Europe
User6891270 wrote:

Dear all,

I have paired-end fastq data generated with Illumina bcl2fastqv2.19 & sequenced on a Novaseq.The i5index is 7bp long, the i7 8bp long

R1.fastq.gz contains R1 101bp reads:

@A00154:125:HGKTMDMXX:1:1101:10420:1000 1:N:0:AACTGAGG+ATGCGTC
CTGGCCGTCTCAGCCGAGAAGCCGAGGATTGAATGGGCATGGAGACTGAACTACCCCTCTCACCTTTAGAGGTGGCTCCTCCAAGTCGGGGTTGACGCCCG
+
FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

R2.fastq.gz contains 6bp UMI sequence

@A00154:125:HGKTMDMXX:1:1101:10420:1000 2:N:0:AACTGAGG+ATGCGTC   
GCGCGT
+
FFFFFF

R3.fastq.gz contains R2 101bp reads:

@A00154:125:HGKTMDMXX:1:1101:10420:1000 3:N:0:AACTGAGG+ATGCGTC
CTTCATAGGCCACAAAAAGCCCATATATCAGTGTCATCCACTAAGCCTCAGACACTGCAGCACGGGCAGCGGCAGTGCCAGCTTCGCCCACACTGCCCCTC
+
FFFFFFFFFFFFFFFFFFFFFF:FF:FFF:FFFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

In a downstream analysis I want to use UMI-tools for deduplication. However for that I need the UMI be part of the read name. @Instrument:RunID:FlowCellID:Lane:Tile:X:Y:UMI ReadNum:FilterFlag:0:IndexSequence or SampleNumber

There are tools to add a UMI to the read name when the UMI is present in the read itself. But in my case, the UMI is in a seperate fastq. How could this be achieved?

fastq umi illumina • 2.2k views
ADD COMMENTlink modified 19 months ago by i.sudbery8.4k • written 19 months ago by User6891270

Looking at the bcl2fastq manual, I have no idea how they made the UMI its own fastq. But bcl2fastq will trim the UMI off of the beginning of the read and put it in the read name if

Read1UMILength,6

TrimUMI,1

is in the sample sheet under "settings"

ADD REPLYlink modified 19 months ago • written 19 months ago by swbarnes28.1k

That's what we tried at first instance. However according to Illumina tech support, we couldn't do this because we were sequencing in dual index & the UMI was only in the i7. The option that you describe only work when you're sequencing single index.

ADD REPLYlink written 19 months ago by User6891270

I'm also curious, what bases mask did you use for the demultiplexing to get these three fastqs?

ADD REPLYlink written 12 months ago by anamaria30
3
gravatar for finswimmer
19 months ago by
finswimmer13k
Germany
finswimmer13k wrote:

An awk solution:

$ awk -v FS="\t" -v OFS="\t" 'NR==FNR {split($1, id, " "); umi[id[1]]=$2;  next;} {split($1, id, " "); $1=id[1]":"umi[id[1]]" "id[2]; print $0}'  <(zcat R2.fastq.gz|paste - - - -) <(zcat R1.fastq.gz|paste - - - -)|tr "\t" "\n"|bgzip -c > R1_umi.fastq.gz

$ awk -v FS="\t" -v OFS="\t" 'NR==FNR {split($1, id, " "); umi[id[1]]=$2;  next;} {split($1, id, " "); $1=id[1]":"umi[id[1]]" "id[2]; print $0}'  <(zcat R2.fastq.gz|paste - - - -) <(zcat R3.fastq.gz|paste - - - -)|tr "\t" "\n"|bgzip -c > R3_umi.fastq.gz

The fastq.gz files get uncompressed by zcat and the 4 line belonging to a read get tab delimited by paste.

awk saves the id and the umi in a list, where the key is the header until the first white space, and the value is the umi code.

If the second fastq file is read, we append the umi to the id and print out the line. Then the tabs are reverted to new lines by tr and the file get compressed using bgzip.

fin swimmer

ADD COMMENTlink modified 19 months ago • written 19 months ago by finswimmer13k
2
gravatar for atalbot
19 months ago by
atalbot20
atalbot20 wrote:

If you align the R1 and R3 to the genome of your choice, you can annotate it with the UMI using the fgbio tool AnnotateBamWithUmis: https://fulcrumgenomics.github.io/fgbio/tools/latest/AnnotateBamWithUmis.html, this does require you to have sufficient memory to store the entire R2 (UMI) .fastq file.

ADD COMMENTlink written 19 months ago by atalbot20
2
gravatar for i.sudbery
19 months ago by
i.sudbery8.4k
Sheffield, UK
i.sudbery8.4k wrote:

Here is what I would do - use UMI-tools and do two passes, one to add the UMI to read1 and one to add the UMI to read2:

umi_tools extract --bc-pattern=NNNNNN --stdin=R2.fastq.gz --read2-in=R1.fastq.gz --stdout=R1.processed.fastq.gz --read2-stdout
umi_tools extract --bc-pattern=NNNNNN --stdin=R2.fastq.gz --read2-in=R3.fastq.gz --stdout=R3.processed.fastq.gz --read2-stdout
ADD COMMENTlink written 19 months ago by i.sudbery8.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 751 users visited in the last hour