Question: Sequence reads and complete assembly
0
gravatar for jeetsahu
25 days ago by
jeetsahu0
jeetsahu0 wrote:

Could someone point me out to fastq short sequence reads and its corresponding assembled fasta file for learning assembly from sequencing reads? Data can be anything from human to insect or plant. I am specially not looking for huge data. Thanks

sequencing sequence assembly • 220 views
ADD COMMENTlink modified 24 days ago by oigl60 • written 25 days ago by jeetsahu0

I am specially not looking for huge data.

Then you are probably looking for bacterial genomes.

ADD REPLYlink written 25 days ago by WouterDeCoster34k

Could you please provide me link to such data set?

ADD REPLYlink written 25 days ago by jeetsahu0

Next question is whether you are looking for short Illumina read or long PacBio/nanopore reads...

ADD REPLYlink written 25 days ago by WouterDeCoster34k

I have gone through this course https://genomics.sschmeier.com/index.html

I used their data to implement the workflow. Now I want some other data sets to start with the assembly.

I am looking for short reads like 150bp long and their corresponding fasta file so that I can compare both fasta files(one given and other assembled by me from sequencing reads).

ADD REPLYlink written 25 days ago by jeetsahu0

Yes, I am looking for short Illumina reads.

ADD REPLYlink written 25 days ago by jeetsahu0

if you download (or plan to use) a certain software there are usually some test datasets provided with it, to test and try out the software

ADD REPLYlink written 25 days ago by lieven.sterck3.1k

I have gone through this course https://genomics.sschmeier.com/index.html

I used their data to implement the workflow. Now I want some other data sets to start with the assembly.

ADD REPLYlink written 25 days ago by jeetsahu0

OK, so does that mean you're going for SPAdes?

if you want to get other data go and have a look at SRA (NCBI) or ENA (EBI) , they have usable interfaces to query the data you want

EDIT: ah, you want to end result as well, then you better first query for a genome assembly submission and then link trough to get to the actual data associated with it

ADD REPLYlink modified 25 days ago • written 25 days ago by lieven.sterck3.1k
3

Or alternatively, look for a publication about "de novo assembly of the...", which should contain links or accession ids for raw data and the assembled sequence

ADD REPLYlink written 25 days ago by WouterDeCoster34k

Guys,I am completely new to this field. I really appreciate if you pin point me to some bacterial genome sequencing data and its corresponding data. Thanks!

ADD REPLYlink written 24 days ago by jeetsahu0
3
gravatar for piet
25 days ago by
piet1.6k
planet earth
piet1.6k wrote:

Only few people submit their reads as well as their assemblies. SAMN04994921 is a nice example where both, a set of Illumina reads and a set of 25 contigs are available.

https://www.ncbi.nlm.nih.gov/biosample/SAMN04994921

https://www.ncbi.nlm.nih.gov/sra/SRR3528286

https://www.ncbi.nlm.nih.gov/Traces/wgs/?val=LXWH01#contigs

ADD COMMENTlink written 25 days ago by piet1.6k

Thanks, I will look into it.

ADD REPLYlink written 23 days ago by jeetsahu0

There are 41 contigs in this file. ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/580/525/GCF_003580525.1_ASM358052v1/GCF_003580525.1_ASM358052v1_genomic.fna.gz

What does it mean? As per my understanding, we create one fasta file containing full genome from pair-end reads.But the above files has 41 contigs. Correct me if I am wrong.

ADD REPLYlink written 22 days ago by jeetsahu0

yes, so?

This simply means they were able to assemble the genome into 41 contigs. What kind of result/data had you hoped for?

ADD REPLYlink written 22 days ago by lieven.sterck3.1k

I was expecting to get a scaffold from sequencing reads. I am using SPAdes for assembly. I have fetched sequencing reads SRR3528286_1.fastq and SRR3528286_2.fastq and ran SPAdes on these two reads. This gave me one scaffolds.fasta file which is just a single string of bases intermittently containing N's. Now I want to compare this fasta file with the one assembled in the project SAMN04994921. But since that fasta file has 41 contigs, I cannot compare them.

ADD REPLYlink written 22 days ago by jeetsahu0

Ah, ok I see.

Yes that can happen, you're not required to submit your scaffolds, the contigs are the minimum requirement. Is there any option you can run SPAdes up to the contig part (omitting the scaffolding step)? I can also hardly imagine the SPAdes will output a single scaffold for this assembly.

Is there really only a single sequence in the SPAdes output (=weird) or is it just a single fasta file (=to be expected)

ADD REPLYlink written 22 days ago by lieven.sterck3.1k

I will explore if SPAdes has contig option. By default SPAdes give just a single fasta file as output. Do you know of any project which have submitted both fasta file and sequencing reads in ncbi?

ADD REPLYlink written 22 days ago by jeetsahu0

I assume the one piet mentioned is one like that?

How many sequences are there in the SPAdes output file? ( grep -c '>' <fasta-file> )

ADD REPLYlink modified 22 days ago • written 22 days ago by lieven.sterck3.1k

I am using the same one mentioned by piet.

There are 257 sequences in output file.

ADD REPLYlink written 22 days ago by jeetsahu0

The given fasta file with 41 contigs is 29,305bp shorter than the one obtained by running SPAdes.

ADD REPLYlink written 22 days ago by jeetsahu0
1

That is very well possible. If you use a different assembler you will get a different result. Many other things might also be in play: they might also have filtered out some contigs, used different parameters, filtering of input data, ... To kinda mimick what they have done you should read up their methods and apply those as well (except for the assembler software then that is).

Personally I would also not simply compare the assemblies on length but rather on 'content' (== compare the actual sequence itself) . There is software around that can do that.

ADD REPLYlink written 22 days ago by lieven.sterck3.1k

How to make sure that the assembly produced by one assembler is the correct one? Two different assembler can produce different assembly. Is there any criteria which can make sure that two different assemblies are similar?

ADD REPLYlink written 22 days ago by jeetsahu0
1

nice topic for a new thread

and if you figure out the answer, let us know as this is probably the million dollar question in the assembly field ;)

ADD REPLYlink written 21 days ago by lieven.sterck3.1k

Hahaha... It's still a open question then.

You mentioned about a software to compare assemblies. Which software is that?

ADD REPLYlink written 21 days ago by jeetsahu0
1

QUAST. You could also use BUSCO.

ADD REPLYlink written 21 days ago by genomax58k
1

There are 47 tools for assembly evaluation in Omictools.
Also, an interesting one comparing the reconstructed LTR by different assembler: Assessing genome assembly quality using the LTR Assembly Index (LAI)

ADD REPLYlink modified 21 days ago • written 21 days ago by Allen Kao50
0
gravatar for oigl
24 days ago by
oigl60
oigl60 wrote:

For studying purposes you can try these UGENE NGS tutorials: https://goo.gl/4Kspho or https://goo.gl/cxHCAU.

ADD COMMENTlink written 24 days ago by oigl60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1597 users visited in the last hour