I got the whole genome sequences of a papaya variety. And I downloaded the ref genome (Carica papaya) from NCBI (scaffold level assembly). When I check for the quality of my sequences, everything is fine except for the Sequence Length Distribution. When I run bowtie without trimming I got the following warning
warning skiping mate #1 read ..... because it was < 2 character long.
And the alignment results was
62674658 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 16140697 + 0 mapped (25.75% : N/A) 62674658 + 0 paired in sequencing 31337329 + 0 read1 31337329 + 0 read2 14389072 + 0 properly paired (22.96% : N/A)
When I run bowtie after trimming, I didnt get any warnings. The alignment percentage is same. But we expect more alignment percentage since my sample also belongs to the papaya variety.
Is there any problem with the reads? Carica papaya (reference genome) has only scaffold level assembly. not chromosomal level. Will that matter in the alignment? Ive pasted few lines from my reference fasta file.
>NW_019011177.1 Carica papaya cultivar SunUp chromosome LG1 unlocalized genomic scaffold, Papaya1.0, whole genome shotgun sequence TAATAAGAACATAGAGTAATATAATTGTGTTGAAAATTTCAGTATGGAATGAAATGTTTGACAAGCTTTGATAGGGATGC.. gaaaaatattatcctATAAGTAATTATACAAATTGCCGCTTCATTTTAGTAATATATTTCTagttaatttacaaaattac cATGGATTATGCtctcttaattttataatatatgctctcttttctcatattttgatattgtatttaatatatatgtataa caagtccataattttattaaaaaaatcataatgaaTACTATAAGTAATTGGAGAGAAGTATGTGATAGTTGGAGAGAAGT AGATTTGGACACGTTAGtagtagaaaaaataatttctaaacaaCAGTGCCATGTTAGCCGTTGAAATGGATGAGAAATGA >NW_019011178.1 Carica papaya cultivar SunUp chromosome LG1 unlocalized genomic scaffold, Papaya1.0, whole genome shotgun sequence CTAAATATCATGTTTGTCTATTTATATCTTTAACTTTGCAAATGTCTAAAGCACTCATGACAAATAGACTCTTAGAAGC TGAAAGCGGCtttaaattaacattaatatCGAACTGCTTTGCACCTAAATCACACAACAAAACATCAAAATTTTGATGAT TATTAAGCGGCAGATAGGCTTATCttgattgattatattttaGATACAAAAAAGCAGTTATTTGTGTCATAATTTTCATC
Is there any problem with my reference genome file? Or should i try any other alignment software? Could you please guide me in this? Base Statistics
`File type Conventional base calls `Encoding Sanger / Illumina 1.9 Total Sequences 31337329 Sequences flagged as poor quality 0 Sequence length 0-151 %GC 36
Reference Genome Statistics
Assembly level: Scaffold Assembly: GCA_000150535.1 Papaya1.0 scaffolds: 17,766 contigs: 47,485 N50: 10,650 L50: 7,081 BioProjects: PRJNA264084, PRJNA20267 Whole Genome Shotgun (WGS): INSDC: ABIM00000000.1 Statistics: total length (Mb): 370.419 protein count: 26103 GC%: 39.0069