How to cut some part of the header of reads in FASTQ file?
1
0
Entering edit mode
3.8 years ago
naeem40thju ▴ 10

The head of reads contains long characters. Actually I wanted to use UMI-tools. If I extract UMI and add to the header ('_' separated), it appends after @ST-E00205:943:HCF3YCCX2:4:1101:11495:1678 (before first space). So, when I group the UMIs, the UMIs are considered as unique due to the presence later part (probably). So I want to discard the end part after space (1:N:0:NCCACGCG+NGATCTCG ). How can I do that? Thanks.

@ST-E00205:943:HCF3YCCX2:4:1101:11495:1678 1:N:0:NCCACGCG+NGATCTCG 
ACCGGATGGTAGACCTGGAGGAGGGGAAAGCCGAGGTGGTGACGGGAGCGGCTGGGGGGGGAGTCCGGGATGGTAGGCGGAGCGGGCAGAGCACAGCAGCTCGTGTAGAAATGG
+ 
7-<--7--7-7F-----77----7---7-------------------7----77-7-----7------7---------7-7------7--7----77----------77-7---
next-gen sequencing • 1.1k views
ADD COMMENT
0
Entering edit mode
3.8 years ago

The reason that UMI-tools adds the UMI after the first part of the read name is that most read mappers discard anything after the first space when they write the BAM file.

If you use UMI-tools to extract, say, the first 12nt of this read, you will get:

@ST-E00205:943:HCF3YCCX2:4:1101:11495:1678_ACCGGATGGTAG 1:N:0:NCCACGCG+NGATCTCG 
ACCTGGAGGAGGGGAAAGCCGAGGTGGTGACGGGAGCGGCTGGGGGGGGAGTCCGGGATGGTAGGCGGAGCGGGCAGAGCACAGCAGCTCGTGTAGAAATGG
+ 
-----77----7---7-------------------7----77-7-----7------7---------7-7------7--7----77----------77-7---

When you map this, the BAM file will look like:

ST-E00205:943:HCF3YCCX2:4:1101:11495:1678_ACCGGATGGTAG    163     1    11064   255     35S13M  =       630874  619858   ACCTGGAGGAGGGGAAAGCCGAGGTGGTGACGGGAGCGGCTGGGGGGGGAGTCCGGGATGGTAGGCGGAGCGGGCAGAGCACAGCAGCTCGTGTAGAAATGG    -----77----7---7-------------------7----77-7-----7------7---------7-7------7--7----77----------77-7---

(note, I've invented everything other than the read name, sequence and quality)

ADD COMMENT
0
Entering edit mode

Hello Ian, old question, but I just ran into the same problem mentioned above without being able to find a suitable answer.

Processing my dual indexed reads, the UMI is appended to the first part of the header before a space.

@M08672:8:000000000-KRLFC:1:1101:10036:1113_TCT 1:N:0:ACCTAAGC+TGGATTCG GGTCGATCGCGCCTCGAGGGCGGCCAGTATTATGCCAGGGAAGATGAAGGACACGGGGGCGTTTGGATTAGCCTGCAGTGTGGGGATTATGTAGTGCTCCGATATGAACGAAAATAGCTGGCCCCACCAAGATCGGAAGAGCACA

However, this naming scheme (read:is filtered:control number:index barcodes) is retained in the BAM file:

M08672:8:000000000-KRLFC:1:2114:25930:12617_AGA 1:N:0:GTCCTACT+CAGGATGA 99      NC_009333.1

If I look into the 'deduplicated_per_umi.tsv' file, the UMI are labeled as:

AAA 1:N:0:ACCTAAGC+TAGATTCG, AAA 1:N:0:ACCTAAGC+TGGATTCG, AAC 1:N:0:AACTAAGC+TGGATTCG, etc.

Am I just misinterpreting the output or is there something else I should be doing?

ADD REPLY
0
Entering edit mode

Thats odd, I've never seen an aligner do that before. What aligner are you using?

ADD REPLY
0
Entering edit mode

We use BBTools bbmap for the mapping as it has performed the most consistent with our data (gammaherpesvirus).

Upon further review of the data, I assume that the multiple entries with the same UMI in the 'deduplicated_per_umi.tsv' are due to differences in the dual indexes due to sequence error. My plan is to run demuxbyname.sh from the BBMap suite prior to using UMItools. It won't adress the naming scheme (read:is filtered:control number:index barcodes) being appended to the UMI, but should hopefully limit the number of entries.

ADD REPLY

Login before adding your answer.

Traffic: 2955 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6