Question

How many reads for WGS Sequencing?

0

Entering edit mode

4 months ago

RT • 0

Hello,

I am new to bioinformatics and I have a microbiology background. I am trying to reproduce this same data from the paper Whole-genome sequencing for analysis of an outbreak of meticillin-resistant Staphylococcus aureus: a descriptive study

I downloaded few reads - 2 x 131bp

ERR072246_1
ERR072246_2

I am working on these files. I don't understand if these reads are the complete genome reads or are there any other reads that belong to this particular sample?

My understanding is that _1 and _2 = entire genome of that particular bacteria from a single sample.

How do I go about my assembly now ? because I know I am missing reads since the sequence length is only 500Kbp where as S. aureus should be 2.7Mbp.

Thanks.

WGS Bacterial-Genomics • 1.3k views

ADD COMMENT • link 4 months ago by RT • 0

0

Entering edit mode

OK. Problem solved. The problem was from my end. Thank you both for helping !!

ADD REPLY • link 4 months ago by RT • 0

2

Entering edit mode

4 months ago

Mensur Dlakic ★ 28k

It takes 29 seconds to assemble this genome (20 CPUs) with the following statistics:

135 contigs, total 2821177 bp, min 200 bp, max 404505 bp, avg 20897 bp, N50 109762 bp

After removing contigs < 2000 bp, it ends up with 58 contigs and 2788979 bp. That seems to be exactly as expected, so I think something in your procedure wasn't done right.

If you want to reproduce what I did, go to this website:

https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=ERR072246&display=metadata

and make sure to download the dataset properly as clipped FASTQ from the far right tab. Then this command will do the trick:

megahit --12 ERR072246.fastq.gz -o ERR072246 --out-prefix ERR072246 -t 20

To download MEGAHIT:

https://github.com/voutcn/megahit

ADD COMMENT • link 4 months ago by Mensur Dlakic ★ 28k

0

Entering edit mode

What does clipped Fastq mean? both forward and reverse reads in the same file?
Why would you remove 2000bp contigs? I feel it will give more information so why discard it? I'll try to use the exact same thing and see how it goes.

ADD REPLY • link 4 months ago by RT • 0

1

Entering edit mode

I think you might be getting stuck on less relevant parts of my exercise. The most important point was that nothing is wrong with the data.

Clipped fastq means that the adapters have been removed. Yes, both forward and reverse reads will be interleaved in the same file if you download them the way I suggested.

It is common to remove really small contigs, though you may want to lower the threshold to 1000 bp since this is a single-genome assembly. There isn't going to be much information in smallest contigs (200 bp) because those contigs can't have even a single complete gene.

ADD REPLY • link 4 months ago by Mensur Dlakic ★ 28k

0

Entering edit mode

got it! even if i'm doing the assembly for paired end with both the files do i have set the threshold to 1000bp ?

ADD REPLY • link 4 months ago by RT • 0

0

Entering edit mode

What does clipped Fastq mean?

Probably means that NCBI has already scanned and trimmed adapter sequencers.

both forward and reverse reads in the same file?

No. That format is called "interleaved" reads.

ADD REPLY • link 4 months ago by GenoMax 145k

0

Entering edit mode

Also, Do you think i can reproduce at least most part of the data from the paper just on my laptop? It has 4 logical processors (Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz 2.90 GHz), 8 GB RAM.

I was asked to work on an MRSA bioinformatics project.

ADD REPLY • link 4 months ago by RT • 0

1

Entering edit mode

Only way to find out is to try. It may work but if it is not going to then you will find that out quick (process would likely crash because of memory since 8GB may not be enough).

ADD REPLY • link 4 months ago by GenoMax 145k

1

Entering edit mode

If you have Linux the assembly should work on your system, but 8 GB is generally not enough for assembling larger genomes.

ADD REPLY • link 4 months ago by Mensur Dlakic ★ 28k

score 2 · Accepted Answer · 2024-04-27

2

Entering edit mode

4 months ago

GenoMax 145k

Did you download the complete dataset available from ENA/NCBI SRA? This is an older dataset (from 2012) with a total of 1146212 reads and 150153772 bases. This is a paired end dataset meaning the library fragments were sequenced from both ends. These reads should still be 55x coverage of the 2.7Mbp Staph genome.

You can use SPAdes (LINK) for assembly.

ADD COMMENT • link 4 months ago by GenoMax 145k

0

Entering edit mode

Well, I downloaded both the reads from-LINK. Thats all i did. Thats the only thing i need to do, right? I'l try spades.

ADD REPLY • link 4 months ago by RT • 0

1

Entering edit mode

I downloaded the data in two files from the link you provided. With this command:

megahit -1 ERR072246_1.fastq.gz -2 ERR072246_2.fastq.gz -o ERR072246 --out-prefix ERR072246 -t 20

it produces the same result as what I did before:

135 contigs, total 2821177 bp, min 200 bp, max 404505 bp, avg 20897 bp, N50 109762 bp

So this is the same dataset except that you downloaded individual reads while in my earlier suggestion they were interleaved. That shouldn't affect the assembly except to give a slightly different command, and indeed it doesn't.

I don't think you need to worry about removing adapters.

ADD REPLY • link 4 months ago by Mensur Dlakic ★ 28k

0

Entering edit mode

I had downloaded the two files in a similar manner actually.

ADD REPLY • link 4 months ago by RT • 0

0

Entering edit mode

Also, there is no mention of adapter sequence so which adapter sequence should i use to remove it from some other reads?

ADD REPLY • link 4 months ago by RT • 0

0

Entering edit mode

You can use the default adapters.fa file included in the resources folder of BBMap suite (program to use is bbduk.sh) or a program like fastp can automatically identify adapters and trim them.

ADD REPLY • link 4 months ago by GenoMax 145k

0

Entering edit mode

ok, i'll try to use fastp. I have only used trimmomatic and cutadapt till now and they dont identify on its own. Except the graph in fastqc says that it's Nextera which idk if i should trust

ADD REPLY • link 4 months ago by RT • 0