Hi biostars,
I am working on a bacteria short reads data downloaded from NCBI. In the description, it mentiones
Design: Illumina sequencing of library 6053957, constructed from sample accession ERS135592 for study accession ERP001405. This is part of an Illumina multiplexed sequencing run (8727_8). This submission includes reads tagged with the sequence GTACATCT.
The tag is in the reference genome.
I am really confused by this. I thought tags work as identifier to separate different isolates in one run. Ans they should be unique to each isolate (which is not in this case either) and not in the reference genome.
What I am thinking could be completely wrong. Can anyone help to explain why such tags are used or in which situation to use them? Should I remove them before alignment? Maybe not, since they are in the reference genome?
Many thanks
Wonder if this is related to this blog post.
I will check the blog. Thanks WouterDeCoster!