I am looking to generate artificial single-end Illumina sequencing data from a FASTA containing a number of short sequences representing unique molecular identifiers. However, I've found limited information on this online and need some help. The start of the input FASTA resembles this (in practice there will be many more sequences):
>1 dna:chromosome ACCTTAGCAGGT TCGAACCTGTGA CGAGTCAGTCTA
And the desired output file is a FASTQ file resembling this:
@HWI-ST745_0097:7:1101:1001:1000#0/1 ACCTTAGCAGGT +HWI-ST745_0097:7:1101:1001:1000#0/1 IIIIIIIIIIII @HWI-ST745_0097:7:1101:1002:1000#0/1 TCGAACCTGTGA +HWI-ST745_0097:7:1101:1002:1000#0/1 IIIIIIIIIIII @HWI-ST745_0097:7:1101:1003:1000#0/1 CGAGTCAGTCTA IIIIIIIIIIII
Importantly for me, each UMI sequence read needs to be 12 bases long, sequenced in order (i.e. starting with the first UMI in the FASTA) and without overlap, i.e. every UMI is read once and each read in the FASTQ is a mirror of the sequence in the FASTA. Basically, I need a version of the FASTA input file with Phred quality scores. I'm relatively new to bioinformatics and NGS and so far I've been looking at 2 programs to achieve this: Artificial fastq generator and ART.
My artfastgen input looks like this:
java -jar ArtificialFastqGenerator.jar -O output.fastq -R input.fasta -S ">1 dna:chromosome" -RL 12 -TLM 12 -GCC False
The (partial) output is as below:
@HWI-ST745_0097:7:1101:1001:1000#0/1 ACCTTAGCAGGT +HWI-ST745_0097:7:1101:1001:1000#0/1 IIIIIIIIIIII @HWI-ST745_0097:7:1101:1002:1000#0/1 CCTTAGCAGGTT +HWI-ST745_0097:7:1101:1002:1000#0/1 IIIIIIIIIIII @HWI-ST745_0097:7:1101:1003:1000#0/1 CTTAGCAGGTTC
This is very close to what I actually need. The only problem is that each read shifts across one base at a time, whereas I need it to shift 12 (or to the start of the next UMI in the input FASTA).
My ART input is this:
art_illumina -ss HS25 -i input.fasta -l 12 -f 1 -o output_dat
Example output varies with each run and is like this:
@1-ACC3 TAGCAGGTTCGA + CCCCCGGGGGGG @1-ACC2 CAGGTTCGAACC + CCCBCGGGEGGG @1-ACC1 CGTCACAGGTTC + BBBCCGBGGGGG
The issue here is that, unlike artfastqgen which allows you to specify the start of the read, ART chooses a random point in the FASTA as the starting point. Like artfastqgen, there is unwanted overlap in the read sequences.
Can anyone hear offer any suggestions on how to achieve this? Ideally using artfastqgen, but any other Linux-based program would also be fine. Thank you.