Question

constructing de novo assembly of plant short reads

0

Entering edit mode

4 months ago

analyst ▴ 70

Dear all,

I have Illumina short reads (paired-end) plant data for which reference genome is not available. I have to construct assembly with short reads only as I do not have long reads. Plant contains tetraploid and diploid varieties I identified best Kmer 37 and 27 respectively through kmergenie without --diploid parameter as it was omitting some information. Please note that read length is 80-127bp, sequencing depth is 14.9X, total sequences are 4-5 million. Total samples are 40, out of 40 paired end samples 20 are diploid varieties and 20 are tetraploid varieties. Do I need to construct two genomes w.r.t. ploidy?

Please guide me if I have used correct approach. Also which short reads assemblers should I use for plant data.

I tried AbySS tool for multiple samples together using command such as:

abyss-pe np=60 k=37 name=FA2 B=1G in='FA2_18_1.fastq.gz FA2_18_2.fastq.gz FA2_19_1.fastq.gz FA2_19_2.fastq.gz FA2_20_1.fastq.gz FA2_20_2.fastq.gz FA2_21_1.fastq.gz FA2_21_2.fastq.gz FA2_22_1.fastq.gz FA2_22_2.fastq.gz'

Can you please confirm if I can assemble multiple samples together.

Best regards,

Bushra

short reads assembly plants • 944 views

ADD COMMENT • link 4 months ago by analyst ▴ 70

0

Entering edit mode

Abyss output scaffolds.fa file contains short kmer sized scaffolds. I am putting first few lines of scaffolds.fa file here as:

>0 37 255 read:bn,LH00330:195:227VG7LT4:6:1103:0:451604/1
CCAGAGCATCTACTAGCAACGGAGAGCATGCAAGATC
>2 37 255 read:bn,LH00330:195:227VG7LT4:6:1103:0:451604/1
AGAGCATCTACTAGCAACGGAGAGCATGCAAGATCAC
>4 37 255 read:bn,LH00330:195:227VG7LT4:6:1103:0:451604/1
AGCATCTACTAGCAACGGAGAGCATGCAAGATCACAA
>8 37 255 read:bn,LH00330:195:227VG7LT4:6:1103:0:451604/1
TCTACTAGCAACGGAGAGCATGCAAGATCACAAATAA
>9 37 255 read:bn,LH00330:195:227VG7LT4:6:1103:0:451604/1
CTACTAGCAACGGAGAGCATGCAAGATCACAAATAAC
>11 37 255 read:bn,LH00330:195:227VG7LT4:6:1103:0:451604/1
ACTAGCAACGGAGAGCATGCAAGATCACAAATAACAT
>17 37 255 read:bn,LH00330:195:227VG7LT4:6:1103:0:451604/1
AACGGAGAGCATGCAAGATCACAAATAACATATGATA
>21 37 255 read:bn,LH00330:195:227VG7LT4:6:1103:0:451604/1
GAGAGCATGCAAGATCACAAATAACATATGATAAATA
>27 37 255 read:bn,LH00330:195:227VG7LT4:6:1103:0:451604/1
ATGCAAGATCACAAATAACATATGATAAATAAATAAT
>31 37 255 read:bn,LH00330:195:227VG7LT4:6:1103:0:451604/1
AAGATCACAAATAACATATGATAAATAAATAATTGAT
>32 37 255 read:bn,LH00330:195:227VG7LT4:6:1103:0:451604/1
AGATCACAAATAACATATGATAAATAAATAATTGATC
>36 37 255 read:bn,LH00330:195:227VG7LT4:6:1117:0:5202810/1
CCATCGAGGTATCCCCTACGACCAACTCCAAATATAG

My concern is that why scaffolds are too short of kmer size that is 37. In abyss kmers are not combined to make up contigs and then scaffolds ?

ADD REPLY • link 4 months ago by analyst ▴ 70

score 2 · Answer 1 · 2025-01-21

2

Entering edit mode

4 months ago

Dave Carlson ★ 2.1k

You could try Spades or MaSuRCA, but between tetraploidy and the general tendency of (most) plants to have highly repetitive genomes, a short read only assembly is very likely to be extremely fragmented and not as useful as you'll want it to be.

ADD COMMENT • link 4 months ago by Dave Carlson ★ 2.1k

0

Entering edit mode

Thankyou so much Dave. Is not spades particularly designed for bacterial genomes?

What do you suggest about abyss, velvet etc.

Thanks alot

ADD REPLY • link 4 months ago by analyst ▴ 70

1

Entering edit mode

I think Spades was originally designed for bacterial genomes, but it can be used for eukaryotes as well, as far as I recall. That said, I've only ever used it to assemble bacteria genomes.

Abyss is another good option. I don't think Velvet has been updated in quite some time, so I would not be surprised if its performance is not as good on modern hardware compared to other tools, but that's just speculation.

Overall, though, whichever tool you use, I fear that your genome assembly will be less than satisfactory due to the reasons mentioned previously.

ADD REPLY • link 4 months ago by Dave Carlson ★ 2.1k

0

Entering edit mode

Thanks for your valuable input!

ADD REPLY • link 4 months ago by analyst ▴ 70

0

Entering edit mode

I am going to give both suggested tools a try. Thankyou Dave.

Can I assemble multiple paired end reads into one assembly using these tools or will I have to generate assemblies for each sample separately?

Thankyou for your help!

ADD REPLY • link 4 months ago by analyst ▴ 70

1

Entering edit mode

If you have multiple samples from the same species, you can just combine the fastq files and generate a single assembly.

ADD REPLY • link 4 months ago by Dave Carlson ★ 2.1k

0

Entering edit mode

Do you mean merging of fastq files and running assembly on merged fastq file?

For example for MaSuRCA I found following command that seems for one sample only how can I adjust it for multiple samples paired end reads to get single assembly.

/path_to_MaSuRCA/bin/masurca -t 32 -i /path_to/pe_R1.fa,/path_to/pe_R2.fa

ADD REPLY • link 4 months ago by analyst ▴ 70

1

Entering edit mode

You can just concatenate the various R1 and R2 fastq files before running the assembly. Let's say you have two samples (sample1, sample2), then you could do the following:

cat sample1_R1.fastq sample2_R1.fastq > combined_R1.fastq
cat sample1_R2.fastq sample2_R2.fastq > combined_R2.fastq

And then run the assembly using the combined_R1.fastq and combined_R2.fastq files.