I've processed some scRNA data from raw reads to clustering and used Drop-seq tools for a part of the processing (the drop-seq pipeline rely partly on picard tools as well). The processing went without trouble, but now when I want to go back and revise my processing and do some extra quality checks, my processing files seem corrupted. For example, running SingleCellRnaSeqMetricsCollector on my file star_gene_exon_tagged.bam produces the following error:
Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing text SAM file. Not enough fields; File /home/path/to/dropseq/output_drop-seq-1.12/star_gene_exon_tagged.bam; Line 1
Line: BAM#Pm@HD VN:1.5 SO:coordinate
The head of my bam file is posted below, but instead of the hashtag #-sign on the first row, there is a small binary number-looking symbol containing ones and zeroes in a square which unfortunately doesn't follow when I'm copying the text.
BAM#Pm@HD VN:1.5 SO:coordinate
@SQ SN:chr1 LN:195471971 M5:c4ec915e7348d42648eefc1534b71c99 UR:file:/media/root/e9558307-3630-41a1-939b-11c494bfc9e2/tools/sccpipe/genomeFastaFiles/metadata_Hanna_06112018.fasta SP:mus_musculus
@SQ SN:chr2 LN:182113224 M5:fe020a692e23f8468b376e14e304a10f UR:file:/media/root/e9558307-3630-41a1-939b-11c494bfc9e2/tools/sccpipe/genomeFastaFiles/metadata_Hanna_06112018.fasta SP:mus_musculus
@SQ SN:chr3 LN:160039680 M5:50f9385167e70825931231ddf1181b80 UR:file:/media/root/e9558307-3630-41a1-939b-11c494bfc9e2/tools/sccpipe/genomeFastaFiles/metadata_Hanna_06112018.fasta SP:mus_musculus
I tried to reheader my file according to this post: Error parsing SAM header. @RG line missing SM tag. Line: @RG ID:None but recieved the following error message:
>> samtools view -H /home/path/to/file/star_gene_exon_tagged.bam | sed 's,^@RG.*,@RG\tID:None\tSM:None\tLB:None\tPL:Illumina,g' | samtools reheader - /home/path/to/file/star_gene_exon_tagged.bam > /home/path/to/new/file/star_gene_exon_tagged_reheader.bam
>>[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
I also tried using the picard program ValidateSamFile but once again got "SAMFormatException on record 01"
Since the processing step I have gzipped these files, but in this example I used the original version instead of extracting from the .gzip. Could this be the problem? Any ideas on how I should fix this without re-processing all my data? Also, does it matter that the UR path to my metadata in the BAM-file is changed since the metadata was created?
Hello berghannaf ,
could you please show us all commands, that were used to generate your
bam
file?The first lines of the bam file must begin with a
@
, because this introduce the header information. A#
is just as false as a "small binary number-looking symbol containing ones and zeroes in a square".fin swimmer
fin swimmer
Hi @finswimmer thank you for your reply. I started the drop-seq pipeline, but it had an error just before the last step, so I restarted the last step. That is, the first files in the pipeline are generated from this code:
and star_gene_exon_tagged.bam is generated from this: