My question is, how can I use cuffdiff when my reference genome is made up of three different sequences just like say three different chromosomes in the fasta file? As cuffdiff gives me an error saying the BAM file is not sorted and lexicographically by chromosome name and by position I am not able to use samtools sort.
Secondly, I think I would need to merge the GTF files? for three different sequences, i.e. E.Coli Genome, GFP and the Plasmid sequence. Can I use cuff merge for this?
The details are explained below.
The sample sent for sequencing had the Ecoli genome, a GFP that was integrated into the Ecoli Genome and there was an additional plasmid sequence. Hence I had made a reference genome fasta sequence file as follows:
Genome_of_Ecoli GFP_Insert Plasmid_sequence
The control samples (that had replicates B29, G29 and R29) were aligned to the reference genome that had the following sequences. The control sample in this case had no plasmid sequence (though other reference samples I am trying to compare differential expression had the third plasmid sequence) Genome_of_Ecoli GFP_Insert
I had aligned my reads against this Fasta file using BWA ( as Ecoli has no alternate splicing). After alignment was done I had piped the SAM output to samtools BAM and then sorted it using samtools as per the command below:
bwa mem -t 13 genome.fa Forward_Read.fq Reverse_Read.fq | samtools view -Shu - | samtools sort - -O ‘bam' -o $out_dir\/$output.bam
I had three biological replicates say B6, G6 and R6 and I wanted to compare the change in the expression with the control samples which had biological replicates say B29, G29 and R29
I gave the following cufflinks command:
cuffdiff -p10 -o diff_expr_test/ ../../genome/dh10b/Escherichia_coli_str_k_12_substr_dh10b.ASM1942v1.31.gtf B6_GGACTCCT-AAGGAGTA.bam, G6_GCTACGCT-AAGGAGTA.bam, R6_TAAGGCGA-AAGGAGTA.bam /data/tellis/analysis/bwa_trim_galore/bwa_tg_dh10b_gfp/B29_CAGAGAGG-AAGGAGTA.bam, /data/tellis/analysis/bwa_trim_galore/bwa_tg_dh10b_gfp/G29_GTAGAGGA-AAGGAGTA.bam, /data/tellis/analysis/bwa_trim_galore/bwa_tg_dh10b_gfp/R29_TCCTGAGC-AAGGAGTA.bam
The Error is get is:
You are using Cufflinks v2.2.1, which is the most recent release. [16:02:15] Loading reference annotation. Warning: No conditions are replicated, switching to 'blind' dispersion method [16:02:15] Inspecting maps and determining fragment length distributions. Error: this SAM file doesn't appear to be correctly sorted! current hit is at Capacity_Monitor_on_pCRIM_lamb:0, last one was at Chromosome_dna:chromosome_chromosome:ASM1942v1:Chromosome:1:4686137:1:4686073 Cufflinks requires that if your file has SQ records in the SAM header that they appear in the same order as the chromosomes names in the alignments. If there are no SQ records in the header, or if the header is missing, the alignments must be sorted lexicographically by chromsome name and by position.
The header for the B6,G6,R6 showed coordinate sorted as shown below:
samtools view -H B6_GGACTCCT-AAGGAGTA.bam @HD VN:1.3 SO:coordinate @SQ SN:Chromosome_dna:chromosome_chromosome:ASM1942v1:Chromosome:1:4686137:1 LN:4686137 @SQ SN:Capacity_Monitor_on_pCRIM_lamb LN:3303 @SQ SN:pD864-LacZ LN:5368 @PG ID:bwa PN:bwa VN:0.7.12-r1039 CL:bwa mem -t 13 /data/tellis/analysis/genome/dh10b_gfp_pd864_lacz/dh10b_gfp_pd864_lacz.fa /data/tellis/analysis/trim_galore/tg_dh10b_gfp_pd864_lacz/B6_GGACTCCT-AAGGAGTA_R1_val_1.fq /data/tellis/analysis/trim_galore/tg_dh10b_gfp_pd864_lacz/B6_GGACTCCT-AAGGAGTA_R2_val_2.fq
The header from the control samples B29, G29, R29 is shown as below:
samtools view -H B29_CAGAGAGG-AAGGAGTA.bam @HD VN:1.3 SO:coordinate @SQ SN:Chromosome_dna:chromosome_chromosome:ASM1942v1:Chromosome:1:4686137:1 LN:4686137 @SQ SN:Capacity_Monitor_on_pCRIM_lamb LN:3303 @PG ID:bwa PN:bwa VN:0.7.12-r1039 CL:bwa mem -t 13 /data/tellis/analysis/genome/dh10b_gfp/dh10b_gfp.fa /data/tellis/analysis/trim_galore/tg_dh10b_gfp/B29_CAGAGAGG-AAGGAGTA_R1_val_1.fq /data/tellis/analysis/trim_galore/tg_dh10b_gfp/B29_CAGAGAGG-AAGGAGTA_R2_val_2.fq
As cufflinks had suggested I tried doing the lexicographically by chromosome name and by position. But since I had sorted BAM files using samtools that gave me errors I converted it to a SAM and sorted it using
samtools view -h file.bam | sort -k3,3 -k4,4n - > $file.sam
The header was lost using this. However, I copied the header from the original bam before converting to sam and then pasted it to sam after the conversation and in once set I did not paste the header after conversion to sam.
When I ran the cuffdiff command again it seemed to work on the sam files that had no header but failed on the sam files that I put the header and gave me the same error as before mentioned below:
cuffdiff -p1 -o diff_expr_test/ ../../genome/dh10b/Escherichia_coli_str_k_12_substr_dh10b.ASM1942v1.31.gtf B6_GGACTCCT-AAGGAGTA.sam, G6_GCTACGCT-AAGGAGTA.sam, R6_TAAGGCGA-AAGGAGTA.sam /data/tellis/analysis/bwa_trim_galore/bwa_tg_dh10b_gfp/B29_CAGAGAGG-AAGGAGTA.sam, /data/tellis/analysis/bwa_trim_galore/bwa_tg_dh10b_gfp/G29_GTAGAGGA-AAGGAGTA.sam, /data/tellis/analysis/bwa_trim_galore/bwa_tg_dh10b_gfp/R29_TCCTGAGC-AAGGAGTA.sam
However, even if cuffdiff ran with sam files that had no header it gives me output but a almost no values in genes.fpkm_tracking in length and coverage etc.