Question: add UMI sequences to fastq read name
0
gravatar for User6891
8 months ago by
User6891250
Europe
User6891250 wrote:

Dear all,

I have paired-end fastq data generated with Illumina bcl2fastqv2.19 & sequenced on a Novaseq.The i5index is 7bp long, the i7 8bp long

R1.fastq.gz contains R1 101bp reads:

@A00154:125:HGKTMDMXX:1:1101:10420:1000 1:N:0:AACTGAGG+ATGCGTC
CTGGCCGTCTCAGCCGAGAAGCCGAGGATTGAATGGGCATGGAGACTGAACTACCCCTCTCACCTTTAGAGGTGGCTCCTCCAAGTCGGGGTTGACGCCCG
+
FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

R2.fastq.gz contains 6bp UMI sequence

@A00154:125:HGKTMDMXX:1:1101:10420:1000 2:N:0:AACTGAGG+ATGCGTC   
GCGCGT
+
FFFFFF

R3.fastq.gz contains R2 101bp reads:

@A00154:125:HGKTMDMXX:1:1101:10420:1000 3:N:0:AACTGAGG+ATGCGTC
CTTCATAGGCCACAAAAAGCCCATATATCAGTGTCATCCACTAAGCCTCAGACACTGCAGCACGGGCAGCGGCAGTGCCAGCTTCGCCCACACTGCCCCTC
+
FFFFFFFFFFFFFFFFFFFFFF:FF:FFF:FFFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

In a downstream analysis I want to use UMI-tools for deduplication. However for that I need the UMI be part of the read name. @Instrument:RunID:FlowCellID:Lane:Tile:X:Y:UMI ReadNum:FilterFlag:0:IndexSequence or SampleNumber

There are tools to add a UMI to the read name when the UMI is present in the read itself. But in my case, the UMI is in a seperate fastq. How could this be achieved?

fastq umi illumina • 1.0k views
ADD COMMENTlink modified 8 months ago by i.sudbery5.3k • written 8 months ago by User6891250

Looking at the bcl2fastq manual, I have no idea how they made the UMI its own fastq. But bcl2fastq will trim the UMI off of the beginning of the read and put it in the read name if

Read1UMILength,6

TrimUMI,1

is in the sample sheet under "settings"

ADD REPLYlink modified 8 months ago • written 8 months ago by swbarnes26.5k

That's what we tried at first instance. However according to Illumina tech support, we couldn't do this because we were sequencing in dual index & the UMI was only in the i7. The option that you describe only work when you're sequencing single index.

ADD REPLYlink written 8 months ago by User6891250

I'm also curious, what bases mask did you use for the demultiplexing to get these three fastqs?

ADD REPLYlink written 6 weeks ago by anamaria30
3
gravatar for finswimmer
8 months ago by
finswimmer12k
Germany
finswimmer12k wrote:

An awk solution:

$ awk -v FS="\t" -v OFS="\t" 'NR==FNR {split($1, id, " "); umi[id[1]]=$2;  next;} {split($1, id, " "); $1=id[1]":"umi[id[1]]" "id[2]; print $0}'  <(zcat R2.fastq.gz|paste - - - -) <(zcat R1.fastq.gz|paste - - - -)|tr "\t" "\n"|bgzip -c > R1_umi.fastq.gz

$ awk -v FS="\t" -v OFS="\t" 'NR==FNR {split($1, id, " "); umi[id[1]]=$2;  next;} {split($1, id, " "); $1=id[1]":"umi[id[1]]" "id[2]; print $0}'  <(zcat R2.fastq.gz|paste - - - -) <(zcat R3.fastq.gz|paste - - - -)|tr "\t" "\n"|bgzip -c > R3_umi.fastq.gz

The fastq.gz files get uncompressed by zcat and the 4 line belonging to a read get tab delimited by paste.

awk saves the id and the umi in a list, where the key is the header until the first white space, and the value is the umi code.

If the second fastq file is read, we append the umi to the id and print out the line. Then the tabs are reverted to new lines by tr and the file get compressed using bgzip.

fin swimmer

ADD COMMENTlink modified 8 months ago • written 8 months ago by finswimmer12k
2
gravatar for atalbot
8 months ago by
atalbot20
atalbot20 wrote:

If you align the R1 and R3 to the genome of your choice, you can annotate it with the UMI using the fgbio tool AnnotateBamWithUmis: https://fulcrumgenomics.github.io/fgbio/tools/latest/AnnotateBamWithUmis.html, this does require you to have sufficient memory to store the entire R2 (UMI) .fastq file.

ADD COMMENTlink written 8 months ago by atalbot20
0
gravatar for i.sudbery
8 months ago by
i.sudbery5.3k
Sheffield, UK
i.sudbery5.3k wrote:

Here is what I would do - use UMI-tools and do two passes, one to add the UMI to read1 and one to add the UMI to read2:

umi_tools extract --bc-pattern=NNNNNN --stdin=R2.fastq.gz --read2-in=R1.fastq.gz --stdout=R1.processed.fastq.gz --read2-stdout
umi_tools extract --bc-pattern=NNNNNN --stdin=R2.fastq.gz --read2-in=R3.fastq.gz --stdout=R3.processed.fastq.gz --read2-stdout
ADD COMMENTlink written 8 months ago by i.sudbery5.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1961 users visited in the last hour