Hello Everyone. I am working with the sra data for whole exome sequence analysis. I am facing a problem regarding the sam file that I created after alignment. I am adding all the steps.
fastq-dump --split-files SRR1178899.sra
fastqc *.fq
bwa mem -t 12 -Y -L 0 -M -R "@RG\tID:sample\tSM:sample\tPL:Illumina" /mnt/nas/reference_genome/BWA/mammals/hg38/genome.fa R1_step1.fq R2_step1.fq > aligned_reads.sam
after this, when I check,
samtools quickcheck aligned_reads.sam aligned_reads.sam was not identified as sequence data.
samtools view -H aligned_reads.sam [main_samview] fail to read the header from "aligned_reads.sam".
less aligned_reads.sam
@SQ SN:chrUn_KI270363v1 LN:1803
@SQ SN:chrUn_KI270364v1 LN:2855
@SQ SN:chrUn_KI270362v1 LN:3530
@SQ SN:chrUn_KI270366v1 LN:8320
@SQ SN:chrUn_KI270378v1 LN:1048
@SQ SN:chrUn_KI270379v1 LN:1045
@SQ SN:chrUn_KI270389v1 LN:1298
@SQ SN:chrUn_KI270390v1 LN:2387
@SQ SN:chrUn_KI270387v1 LN:1537
@SQ SN:chrUn_KI270395v1 LN:1143
@SQ SN:chrUn_KI270396v1 LN:1880
@SQ SN:chrUn_KI270388v1 LN:1216
@SQ SN:chrUn_KI270394v1 LN:970
@SQ SN:chrUn_KI270386v1 LN:1788
@SQ SN:chrUn_KI270391v1 LN:1484
@SQ SN:chrUn_KI270383v1 LN:1750
@SQ SN:chrUn_KI270393v1 LN:1308
@SQ SN:chrUn_KI270384v1 LN:1658
@SQ SN:chrUn_KI270392v1 LN:971
@SQ SN:chrUn_KI270381v1 LN:1930
@SQ SN:chrUn_KI270385v1 LN:990
@SQ SN:chrUn_KI270382v1 LN:4215
@SQ SN:chrUn_KI270376v1 LN:1136
@SQ SN:chrUn_KI270374v1 LN:2656
@SQ SN:chrUn_KI270372v1 LN:1650
@SQ SN:chrUn_KI270373v1 LN:1451
@SQ SN:chrUn_KI270375v1 LN:2378
@SQ SN:chrUn_KI270371v1 LN:2805
@SQ SN:chrUn_KI270448v1 LN:7992
@SQ SN:chrUn_KI270521v1 LN:7642
@SQ SN:chrUn_GL000195v1 LN:182896
@SQ SN:chrUn_GL000219v1 LN:179198
@SQ SN:chrUn_GL000220v1 LN:161802
@SQ SN:chrUn_GL000224v1 LN:179693
@SQ SN:chrUn_KI270741v1 LN:157432
@SQ SN:chrUn_GL000226v1 LN:15008
@SQ SN:chrUn_GL000213v1 LN:164239
@SQ SN:chrUn_KI270743v1 LN:210658
@SQ SN:chrUn_KI270744v1 LN:168472
@SQ SN:chrUn_KI270745v1 LN:41891
@SQ SN:chrUn_KI270746v1 LN:66486
@SQ SN:chrUn_KI270747v1 LN:198735
@SQ SN:chrUn_KI270748v1 LN:93321
@SQ SN:chrUn_KI270749v1 LN:158759
@SQ SN:chrUn_KI270750v1 LN:148850
@SQ SN:chrUn_KI270751v1 LN:150742
@SQ SN:chrUn_KI270752v1 LN:27745
@SQ SN:chrUn_KI270753v1 LN:62944
@SQ SN:chrUn_KI270754v1 LN:40191
@SQ SN:chrUn_KI270755v1 LN:36723
@SQ SN:chrUn_KI270756v1 LN:79590
@SQ SN:chrUn_KI270757v1 LN:71251
@SQ SN:chrUn_GL000214v1 LN:137718
@SQ SN:chrUn_KI270742v1 LN:186739
@SQ SN:chrUn_GL000216v2 LN:176608
@SQ SN:chrUn_GL000218v1 LN:161147
@SQ SN:chrEBV LN:171823
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -t 12 -M /mnt/nas/reference_genome/BWA/mammals/hg38/genome.fa R1_step1.fq R2_step1.fq
SRR1178899.1 77 * 0 0 * * 0 0 TNGTTCCAGCGACAGCCCATCCTATAGCACTCTCCAGGAGAGAAATCCAGCACACAAAAAAGATTCTACATCTATTAAGTAAGTGAGGTCTGAGTTGGAT =#14:ADD=D@FFGCGHIIAHHGGGDC@DHC?::BGFHG;?(?FH>4.8))7=8(5-=?AB####################################### AS:i:0 XS:i:0
SRR1178899.1 141 * 0 0 * * 0 0 ACATATATTGGAAACTACAACACTATGGGGAAGAGAACCAATTCAGAACTCAATAACTTAATAGAAGGAGAAGCTTTTTGATGTACTATATTTCTCTCCA #################################################################################################### AS:i:0 XS:i:0
SRR1178899.2 77 * 0 0 * * 0 0 TNTTTCCAGCGACAGCCCATCCTATAGCACTCTCCAGGAGAGAAATTTAGTACACAATAAGGAGACCCCTCGTCTTAAGTGCGGTCGGTAAGAGTCGGAT <#144ADDDDDDDIIIIIIIIIIIIIIIIIIIIIIIIIDIDIDIIIICIICEEICC############################################ AS:i:0 XS:i:0
SRR1178899.2 141 * 0 0 * * 0 0 AGGATTTAAATAGGCGCTCGGGGTCTGCAATAGCCCCCAGCTGCGTGGTAAATCTGCCTCACGGAGTGTCCTGTAGGATTGGCTACTACGGGGAACGCAG #################################################################################################### AS:i:0 XS:i:0
SRR1178899.3 83 chr12 94582927 60 45M55S = 94582829 -143 ATCCAACTCAGAACTCACTCACTTAATAGAAGGAGAATCTTTTTTGTGTACTAAATTTCTCTCCTGGAGAGTGCTATAGGATGGGCTGTCGCTGGATGNA ###DDDCCA@:;5@A:ACCC;:B>-C>33CCECC>C@EDFGIGGGIHHEDJIIGHJHGGD=FEGFGIFIIJHEDCGFJJIHGHFE@GHFCFAFFDDA1#@ NM:i:0 MD:Z:45 MC:Z:100M AS:i:45 XS:i:0
SRR1178899.3 163 chr12 94582829 60 100M = 94582927 143 ATGAGGTCACCAGTCAGTCCCGGTCTCCCAAAGTGCCCAGGTAACTGGAATGCCTGCCATGCCACATTCACTGGGAACTTCACCACTATGGGGAACGCAT @@BFFFFFFHHBCAFE@EICFFG@GFCHGHGID?BFGGIJIFDFGEHEFFGHEHIJJJIJJIIEHHHHHHHF>B?BEDDECECDDBCDDD>ABDBDBB@B NM:i:1 MD:Z:60A39 MC:Z:45M55S AS:i:95 XS:i:20
SRR1178899.4 83 chr1 121509428 60 46M54S = 121509376 -98 ATGCCCAACAATGACAGACTGAATAAAGAAATTGTGCTACATATATGTGTACTAAGTTTCTCTCCTGGAGAGTGCTATAGGAGGGCTGTCGCTGGAGCNC 9528?<CA:(55(A@;(>>;@@A;.;7););6..7A?==HDC;@=)..///8.=B8>EGEHHBFD?*FGFB9B<F>HAHGHFE;BG@AFFAA:DBA41#? NM:i:0 MD:Z:46 MC:Z:98M2S AS:i:46 XS:i:30
SRR1178899.4 163 chr1 121509376 60 98M2S = 121509428 98 AAATGTTTACTGCAACATTATTCATGATAGCAAAGATATGAAATCAACCTAAATGCCCAACAATGACAGACTGAATAAAGAAATTGTGCTACATATATGT @@@DFDDFFFHHHC@<:CC?IHCBHHIICH???DFCGGIIIGIIDAC??D:BAFH@?F?DDBEHGGF
So, I create a header.txt file for hg38 genome which is my reference genome.
@HD VN:1.6 SO:coordinate
@SQ SN:chr1 LN:248956422
@SQ SN:chr2 LN:242193529
@SQ SN:chr3 LN:198295559
@SQ SN:chr4 LN:190214555
@SQ SN:chr5 LN:181538259
@SQ SN:chr6 LN:170805979
@SQ SN:chr7 LN:159345973
@SQ SN:chr8 LN:145138636
@SQ SN:chr9 LN:138394717
@SQ SN:chr10 LN:133797422
@SQ SN:chr11 LN:135086622
@SQ SN:chr12 LN:133275309
@SQ SN:chr13 LN:114364328
@SQ SN:chr14 LN:107043718
@SQ SN:chr15 LN:101991189
@SQ SN:chr16 LN:90338345
@SQ SN:chr17 LN:83257441
@SQ SN:chr18 LN:80373285
@SQ SN:chr19 LN:58617616
@SQ SN:chr20 LN:64444167
@SQ SN:chr21 LN:46709983
@SQ SN:chr22 LN:50818468
@SQ SN:chrX LN:156040895
@SQ SN:chrY LN:57227415
@SQ SN:chrM LN:16569
cat header.txt aligned_reads.sam > aligned_header.sam samtools quickcheck aligned_header.sam align_header.sam was not identified as sequence data.
Can you please help in this case? It is very urgent. Thank you in advance.
I can provide more infromation if you need.
where is this '@RG' in the header anyway ? are you sure you're handling the correct file ?
Thank you for your reply.
I am not very much sure about this RG. Can you explain me a bit?
https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups