Question: SO:coordinate Error parsing text SAM file. Line: BAM#Pm@HD VN:1.5 SO:coordinate
gravatar for chilifan
3 months ago by
chilifan70 wrote:

I've processed some scRNA data from raw reads to clustering and used Drop-seq tools for a part of the processing (the drop-seq pipeline rely partly on picard tools as well). The processing went without trouble, but now when I want to go back and revise my processing and do some extra quality checks, my processing files seem corrupted. For example, running SingleCellRnaSeqMetricsCollector on my file star_gene_exon_tagged.bam produces the following error:

Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing text SAM file. Not enough fields; File /home/path/to/dropseq/output_drop-seq-1.12/star_gene_exon_tagged.bam; Line 1
Line: BAM#Pm@HD VN:1.5  SO:coordinate

The head of my bam file is posted below, but instead of the hashtag #-sign on the first row, there is a small binary number-looking symbol containing ones and zeroes in a square which unfortunately doesn't follow when I'm copying the text.

BAM#Pm@HD   VN:1.5  SO:coordinate 
@SQ SN:chr1 LN:195471971    M5:c4ec915e7348d42648eefc1534b71c99 UR:file:/media/root/e9558307-3630-41a1-939b-11c494bfc9e2/tools/sccpipe/genomeFastaFiles/metadata_Hanna_06112018.fasta   SP:mus_musculus 
@SQ SN:chr2 LN:182113224    M5:fe020a692e23f8468b376e14e304a10f UR:file:/media/root/e9558307-3630-41a1-939b-11c494bfc9e2/tools/sccpipe/genomeFastaFiles/metadata_Hanna_06112018.fasta   SP:mus_musculus 
@SQ SN:chr3 LN:160039680    M5:50f9385167e70825931231ddf1181b80 UR:file:/media/root/e9558307-3630-41a1-939b-11c494bfc9e2/tools/sccpipe/genomeFastaFiles/metadata_Hanna_06112018.fasta   SP:mus_musculus

I tried to reheader my file according to this post: Error parsing SAM header. @RG line missing SM tag. Line: @RG ID:None but recieved the following error message:

>> samtools view -H /home/path/to/file/star_gene_exon_tagged.bam | sed 's,^@RG.*,@RG\tID:None\tSM:None\tLB:None\tPL:Illumina,g' |  samtools reheader - /home/path/to/file/star_gene_exon_tagged.bam > /home/path/to/new/file/star_gene_exon_tagged_reheader.bam

>>[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated

I also tried using the picard program ValidateSamFile but once again got "SAMFormatException on record 01"

Since the processing step I have gzipped these files, but in this example I used the original version instead of extracting from the .gzip. Could this be the problem? Any ideas on how I should fix this without re-processing all my data? Also, does it matter that the UR path to my metadata in the BAM-file is changed since the metadata was created?

ADD COMMENTlink modified 3 months ago • written 3 months ago by chilifan70

Hello berghannaf ,

could you please show us all commands, that were used to generate your bam file?

The first lines of the bam file must begin with a @, because this introduce the header information. A # is just as false as a "small binary number-looking symbol containing ones and zeroes in a square".

fin swimmer

fin swimmer

ADD REPLYlink written 3 months ago by finswimmer11k

Hi @finswimmer thank you for your reply. I started the drop-seq pipeline, but it had an error just before the last step, so I restarted the last step. That is, the first files in the pipeline are generated from this code:

>>/home/path/to/tools/Drop-seq-1.12/ -g /home/path/to/reference/genome_index_STAR -r /home//path/to//meta_data_and_reference/GRCm38.p6.genome.fa -d /home/path/to/tools/Drop-seq-1.12 -o /home/path/to/tools/Drop-seq-1.12/output -t /home/path/to/tools/Drop-seq-1.12/temp -s /home/path/to/tools/STAR/bin/Linux_x86_64_static/STAR /home/path/to/tools/Drop-seq-1.12/output/Sample_CP1-1B.bam

and star_gene_exon_tagged.bam is generated from this:

>>/home/path/to/tools/Drop-seq-1.12/TagReadWithGeneExon INPUT=/home/path/to/tools/Drop-seq-1.12/temp/merged.bam OUTPUT=/home/path/to/tools/Drop-seq-1.12/temp/star_gene_exon_tagged.bam ANNOTATIONS_FILE=/home/path/to/meta_data_and_reference/GRCm38.p6.genome.fa.refFlat TAG=GE CREATE_INDEX=true STRAND_TAG=GS FUNCTION_TAG=XF USE_STRAND_INFO=true ALLOW_MULTI_GENE_READS=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=10000 CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
ADD REPLYlink written 3 months ago by chilifan70
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1791 users visited in the last hour