Question

What to do with mapped reads after assembly?

0

Entering edit mode

4.4 years ago

DNAngel ▴ 250

Hello everyone,

Sorry for posting a simple question but although I have worked with some assembly before, I never worked with data at the whole genome scale and I would like to ask for some advice.

I followed this tutorial on biostairs: Reference Assembly - Mapping Reads To A Reference Genome but I ended up just converting my mapped to fastq (so I did not obtain mpileup files as I wasn't doing any variant calling - although this will likely change once I get my first basic pipeline working).

My question now is...what can I do with my mapped reads? Most tutorials like the one I linked sort of just end off after gathering bam files or converting them into fastq files of mapped reads and then it sort of just stops there.

Looking at other biostars forums looks like the next thing I need to do is to annotate my mapped reads - but how do I do that? I have an annotation.gff file but I really don't understand how do I link my bam files to the annotation.gff file. The coordinates in the bam file should be identical to the .gff file right? Am I supposed to annotate my bam files and append more information into it? And if so, is this something I can do with samtools or something similar?

My end goal right now is to visualize my assembly and get information on how many genes mapped to the ref genome, which genes mapped (i.e. annotations!), and just overall quality of the mapping. But I just don't understand how to get the information from the annotation.gff file onto my mapped reads in my bam file (which btw are all still separate sequences in a read1.fastq and read2.fastq file). Any tips/advice is much appreciated. Thank you for reading!

Assembly annotation • 1.3k views

ADD COMMENT • link 4.4 years ago by DNAngel ▴ 250

0

Entering edit mode

My end goal right now is to visualize my assembly and get information on how many genes mapped to the ref genome, which genes mapped (i.e. annotations!), and just overall quality of the mapping. But I just don't understand how to get the information from the annotation.gff file onto my mapped reads in my bam file

It looks like you just aligned your reads to a reference assembly. You did not actually do an assembly. These are two different workflows.

Assuming you did align your data to a reference you can easily visualize the alignments using IGV (link for user guide).
You don't map genes to reference genome. You map reads (which you have done) and then using annotation file (which defines the gene models/boundaries) you find out which reads fall within those gene boundaries. You use your alignment + annotation in IGV to do this. Since your experiment does not seem to deal with expression you will not able to get any counts/expression. You will be able to see if all of the genome is represented in your data.

That said what was the starting point of this entire exercise. What question are you asking of this data?

ADD REPLY • link 4.4 years ago by GenoMax 152k

0

Entering edit mode

Doing an actual assembly would mean de novo right? So without a reference genome? I also do have a a transcript.fasta.gz file, along with a annotation.gff.gz, a RepeatMasked.gff.gz file and a Repeat.fasta.gz (genome file). The data was given to me to work on although I will specifically be working on unmapped reads later on (trying to figure out what other species were sequenced). Right now, I just want to know what I can do with my mapped reads and the annotation file. I've never worked with transcript data before, so I am assuming since I have this file I should be able to get some expression analysis done but this would literally be my first time even seeing this type of data.

From my understanding, I can load the annotation.gff file and my mapped reads, along with my genome.fasta file into IGV and see the data but just looking at it in IGV is not meaningful. I need to gather stats - like what genes are mapped, and possibly how well they mapped to the genome? I also am trying to understand what else people can do with the annotation file and mapped reads.

I will probably also be asked later on to work with the transcript.fasta.gz file for expression analysis (but I have never worked with this type of data ever so this will be entirely new).

ADD REPLY • link 4.4 years ago by DNAngel ▴ 250