Question: Generating artificial NGS reads from a FASTA file specifying read length, coverage, and start sequence.
0
gravatar for Hamish
17 months ago by
Hamish10
Hamish10 wrote:

I am looking to generate artificial single-end Illumina sequencing data from a FASTA containing a number of short sequences representing unique molecular identifiers. However, I've found limited information on this online and need some help. The start of the input FASTA resembles this (in practice there will be many more sequences):

>1 dna:chromosome
ACCTTAGCAGGT
TCGAACCTGTGA
CGAGTCAGTCTA

And the desired output file is a FASTQ file resembling this:

@HWI-ST745_0097:7:1101:1001:1000#0/1
ACCTTAGCAGGT
+HWI-ST745_0097:7:1101:1001:1000#0/1
IIIIIIIIIIII
@HWI-ST745_0097:7:1101:1002:1000#0/1
TCGAACCTGTGA
+HWI-ST745_0097:7:1101:1002:1000#0/1
IIIIIIIIIIII
@HWI-ST745_0097:7:1101:1003:1000#0/1
CGAGTCAGTCTA
IIIIIIIIIIII

Importantly for me, each UMI sequence read needs to be 12 bases long, sequenced in order (i.e. starting with the first UMI in the FASTA) and without overlap, i.e. every UMI is read once and each read in the FASTQ is a mirror of the sequence in the FASTA. Basically, I need a version of the FASTA input file with Phred quality scores. I'm relatively new to bioinformatics and NGS and so far I've been looking at 2 programs to achieve this: Artificial fastq generator and ART.

My artfastgen input looks like this:

java -jar ArtificialFastqGenerator.jar -O output.fastq -R input.fasta -S ">1 dna:chromosome" -RL 12 -TLM 12 -GCC False

The (partial) output is as below:

@HWI-ST745_0097:7:1101:1001:1000#0/1
ACCTTAGCAGGT
+HWI-ST745_0097:7:1101:1001:1000#0/1
IIIIIIIIIIII
@HWI-ST745_0097:7:1101:1002:1000#0/1
CCTTAGCAGGTT
+HWI-ST745_0097:7:1101:1002:1000#0/1
IIIIIIIIIIII
@HWI-ST745_0097:7:1101:1003:1000#0/1
CTTAGCAGGTTC

This is very close to what I actually need. The only problem is that each read shifts across one base at a time, whereas I need it to shift 12 (or to the start of the next UMI in the input FASTA).

My ART input is this:

art_illumina -ss HS25 -i input.fasta -l 12 -f 1 -o output_dat

Example output varies with each run and is like this:

@1-ACC3
TAGCAGGTTCGA
+
CCCCCGGGGGGG
@1-ACC2
CAGGTTCGAACC
+
CCCBCGGGEGGG
@1-ACC1
CGTCACAGGTTC
+
BBBCCGBGGGGG

The issue here is that, unlike artfastqgen which allows you to specify the start of the read, ART chooses a random point in the FASTA as the starting point. Like artfastqgen, there is unwanted overlap in the read sequences.

Can anyone hear offer any suggestions on how to achieve this? Ideally using artfastqgen, but any other Linux-based program would also be fine. Thank you.

sequencing next-gen • 1.0k views
ADD COMMENTlink modified 17 months ago by Gabriel R.2.7k • written 17 months ago by Hamish10
1

I don't believe that there are any fastq file simulation programs available that will generate UMI's along with the reads at this time.

ADD REPLYlink written 17 months ago by genomax87k
1

Though I generally second genomax opinion, I would like to add a very comprehensive Nature Genetics Review on the simulation topic:

Escalona et al, 2018: A comparison of tools for the simulation of genomic next-generation sequencing data

ADD REPLYlink modified 17 months ago • written 17 months ago by Carambakaracho2.2k

Since posting I've found a way to achieve this using ART, but this looks like a very relevant and helpful review. Thanks for the input.

ADD REPLYlink written 17 months ago by Hamish10
2
gravatar for Gabriel R.
17 months ago by
Gabriel R.2.7k
Danmarks Tekniske Universitet
Gabriel R.2.7k wrote:

I would use the approach we used for gargammel (https://grenaud.github.io/gargammel/). I would use the following options:

art_illumina  -amp -na   -c 1

so -amp says that you do an "amplicon sequencing simulation", in your case, your amplicons are your starting sequences. -na means "do not output ALN alignment file" because we do not need it and -c 1 to generate 1 reads / input sequence.

hope this helps

ADD COMMENTlink written 17 months ago by Gabriel R.2.7k
1

Thanks for the advice and apologies for my delayed reply. I modified my input command per your suggestion to art_illumina -i input.fasta -na -amp -l 12 -c 1 -o outputx_dat. After some additional troubleshooting, I realised there was a problem with the format of the fasta file I was using. It now has multiple headers dividing each sequence to be read, i.e.

>1
ACCTTAGCAGGT
>2
TCGAACCTGTGA
>3
CGAGTCAGTCTA

ART now processes each read once and in order. Thanks again for the input.

ADD REPLYlink written 17 months ago by Hamish10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1017 users visited in the last hour