Question: SO:coordinate Error parsing text SAM file. Line: BAM#Pm@HD VN:1.5 SO:coordinate
gravatar for chilifan
13 months ago by
chilifan70 wrote:

I've processed some scRNA data from raw reads to clustering and used Drop-seq tools for a part of the processing (the drop-seq pipeline rely partly on picard tools as well). The processing went without trouble, but now when I want to go back and revise my processing and do some extra quality checks, my processing files seem corrupted. For example, running SingleCellRnaSeqMetricsCollector on my file star_gene_exon_tagged.bam produces the following error:

Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing text SAM file. Not enough fields; File /home/path/to/dropseq/output_drop-seq-1.12/star_gene_exon_tagged.bam; Line 1
Line: BAM#Pm@HD VN:1.5  SO:coordinate

The head of my bam file is posted below, but instead of the hashtag #-sign on the first row, there is a small binary number-looking symbol containing ones and zeroes in a square which unfortunately doesn't follow when I'm copying the text.

BAM#Pm@HD   VN:1.5  SO:coordinate 
@SQ SN:chr1 LN:195471971    M5:c4ec915e7348d42648eefc1534b71c99 UR:file:/media/root/e9558307-3630-41a1-939b-11c494bfc9e2/tools/sccpipe/genomeFastaFiles/metadata_Hanna_06112018.fasta   SP:mus_musculus 
@SQ SN:chr2 LN:182113224    M5:fe020a692e23f8468b376e14e304a10f UR:file:/media/root/e9558307-3630-41a1-939b-11c494bfc9e2/tools/sccpipe/genomeFastaFiles/metadata_Hanna_06112018.fasta   SP:mus_musculus 
@SQ SN:chr3 LN:160039680    M5:50f9385167e70825931231ddf1181b80 UR:file:/media/root/e9558307-3630-41a1-939b-11c494bfc9e2/tools/sccpipe/genomeFastaFiles/metadata_Hanna_06112018.fasta   SP:mus_musculus

I tried to reheader my file according to this post: Error parsing SAM header. @RG line missing SM tag. Line: @RG ID:None but recieved the following error message:

>> samtools view -H /home/path/to/file/star_gene_exon_tagged.bam | sed 's,^@RG.*,@RG\tID:None\tSM:None\tLB:None\tPL:Illumina,g' |  samtools reheader - /home/path/to/file/star_gene_exon_tagged.bam > /home/path/to/new/file/star_gene_exon_tagged_reheader.bam

>>[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated

I also tried using the picard program ValidateSamFile but once again got "SAMFormatException on record 01"

Since the processing step I have gzipped these files, but in this example I used the original version instead of extracting from the .gzip. Could this be the problem? Any ideas on how I should fix this without re-processing all my data? Also, does it matter that the UR path to my metadata in the BAM-file is changed since the metadata was created?

ADD COMMENTlink modified 13 months ago • written 13 months ago by chilifan70

Hello berghannaf ,

could you please show us all commands, that were used to generate your bam file?

The first lines of the bam file must begin with a @, because this introduce the header information. A # is just as false as a "small binary number-looking symbol containing ones and zeroes in a square".

fin swimmer

fin swimmer

ADD REPLYlink written 13 months ago by finswimmer13k

Hi @finswimmer thank you for your reply. I started the drop-seq pipeline, but it had an error just before the last step, so I restarted the last step. That is, the first files in the pipeline are generated from this code:

>>/home/path/to/tools/Drop-seq-1.12/ -g /home/path/to/reference/genome_index_STAR -r /home//path/to//meta_data_and_reference/GRCm38.p6.genome.fa -d /home/path/to/tools/Drop-seq-1.12 -o /home/path/to/tools/Drop-seq-1.12/output -t /home/path/to/tools/Drop-seq-1.12/temp -s /home/path/to/tools/STAR/bin/Linux_x86_64_static/STAR /home/path/to/tools/Drop-seq-1.12/output/Sample_CP1-1B.bam

and star_gene_exon_tagged.bam is generated from this:

>>/home/path/to/tools/Drop-seq-1.12/TagReadWithGeneExon INPUT=/home/path/to/tools/Drop-seq-1.12/temp/merged.bam OUTPUT=/home/path/to/tools/Drop-seq-1.12/temp/star_gene_exon_tagged.bam ANNOTATIONS_FILE=/home/path/to/meta_data_and_reference/GRCm38.p6.genome.fa.refFlat TAG=GE CREATE_INDEX=true STRAND_TAG=GS FUNCTION_TAG=XF USE_STRAND_INFO=true ALLOW_MULTI_GENE_READS=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=10000 CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
ADD REPLYlink written 13 months ago by chilifan70
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1158 users visited in the last hour