What do you do after obtaining sorted bam files?
1
0
Entering edit mode
3.3 years ago
DNAngel ▴ 240

I am stuck at the stage where after aligning my single-end illumina reads using bwa mem to a reference file which consists of a single chromosome (I am aligning each time to a single chromosome due to time limits on my server), I cannot find complete tutorials that explain how to extract contigs/scaffolds from bam files in order to annotate. I've read similar posts here but they all seem to implement a de novo strategy (like velvetg which produces a nice file of contigs) or use annotation programs meant for mammalian genomes or bacteria - the species I work with are fish.

My code thus far (after running FastQC and trimmomatic to clean my fastq reads):

bwa index ref_chr1.fa
bwa mem -B 2 -M -t 40 ref_chr1.fa sp1_trimmed.fastq > sp1_trimmed.sam
samtools view -Sb -@ 40 sp1_trimmed.sam > sp1_trimmed.bam
samtools sort -@ 40 sp1_trimmed.bam > sp1_trimmed_sorted.bam
samtools index sp1_trimmed_sorted.bam
samtools mpileup -s -a -f ref_chr1.fa sp1_trimmed_sorted.bam -o sp1_trimmed_sorted.mpileup


HERE is where I am stuck. Do I even need the mpileup step? Can I just use my sorted and indexed bam files to somehow create scaffolds (no idea how to do that).

I have my species reference file in .gff format as well, as I just want to obtain all the CDS from my species so I can run various tests on each gene which I need separate so I can create MSAs for each (after extracting the gene for about 30 species).

Please any help is much appreciated!

bwa bam • 1.5k views
0
Entering edit mode

how to extract contigs/scaffolds from bam files in order to annotate

what does it mean ? what do you want to annotate ?

0
Entering edit mode

I guess this is the part I am just not sure what to do - I have my sorted, indexed bam files and from an old tutorial where I used velvet for de novo, we had to end up with contig files that can be annotated. I don't know how to get to that step using my bam files after using bwa mem to assemble my reads. From what I've read the bam files just have assembled reads (contigs?) which I should assemble into scaffolds so I can then annotate and extract all protein-coding genes (I have a reference file with CDS sequences as well as the reference genome in .gff format).

I tried producing mpileup files to obtain consensus sequences but that failed - there were a huge amount of gaps in the data and I think it is because my data is exome sequencing and I was mapping to genomic so a lot of the intergenic regions are empty.

1
Entering edit mode
3.3 years ago

I am aligning each time to a single chromosome due to time limits on my server

Don't do this. If you must split your job into chunks, split the fastq, not the reference.

I'm not totally sure that aligning to the reference is what you want to do if you are trying to figure out de novo what the transcripts are.

0
Entering edit mode

This did not work on my server because of time limits so I am trying a different approach where I am using velvet for de novo assembly, and then I was going to align the contigs back to the reference genome using bwa but on a different server (which has way less space but no time limits at least). However if that part still does work at least I have the contigs which I can work with.

So far it is running but I did not realize it is still difficult to understand what program to use to aseemble the contigs and obtain protein-coding genes for vertebrate genomes...any suggestions?

0
Entering edit mode

And you really can't get even a small fastq aligned to the whole genome?

You are trying to assemble an entire vertebrate genome with velvet, but you can't align to a whole genome?

Are you sure that aligning to the genome helps the annotation process?

0
Entering edit mode

I think some of my variables in my script were slowing down the bwa alignment because it was running samtools simultaenously afterwards and indexing files - but it does work now if I break it up and just have it stop after bwa alignment. I thought it was a size issue as well but turns out it was just running too many steps that I could have split a part easily - I only just realized this after reading your comment lol.

With that said, this is great news for me! I can do de novo with velvet and reference assembly and compare both. I do not know at all if aligning to the genome helps or not I have never done this before and all the tutorials I've read that use small example files all say to do both methods to see if it does help for individual cases. And with velvet it just produces a file with contigs, but those contigs are not assembled or aligned in any way for me to extract genes. I would still have to assemble those contigs into scaffolds which I don't know how to do.