Question: HTseq code error repeatedly
0
gravatar for bnayer26
5 months ago by
bnayer260
bnayer260 wrote:

Hi, I am really new to bioinformatics so please help me figure out this error. I have tried looking at other threads with similar questions but couldn't resolve my problem.

I get the following error when I am using HTseq for counting:

Error occured when reading beginning of SAM/BAM file.
('SAM line does not contain at least 11 tab-delimited fields.',
'line 1 of file Sorted_KKUGCTM5_-VE_14-4_aligned.bam')
[Exception type: ValueError, raised in _HTSeq.pyx:1276]

The code I am using is as follows:

/HTSeq-0.6.1/scripts/htseq-count --stranded=no Sorted_KKUGCTM5_-VE_14-4_aligned.bam GRCh38/Homo_sapiens.GRCh38.90.gtf

I have tried using the above code, then I also tried converting the above bam file to sam file, and then used the sam file in the above-mentioned code, and I still get the following error:

Warning: Malformed SAM line: MRNM != '*' although flag bit &0x0008 set
Warning: Read 700463F:369:CB93CANXX:5:1103:15415:8299 claims to have an aligned mate which could not be found in an adjacent line.
Error occured when processing SAM input (line 57 of file Sorted_KKUGCTM5_-VE_14-4_aligned1.sam):
  'pair_alignments' needs a sequence of paired-end alignments
 [Exception type: ValueError, raised in __init__.py:603]

Please help where am I going wrong?

rna-seq htseq • 338 views
ADD COMMENTlink modified 5 months ago by genomax70k • written 5 months ago by bnayer260
2
gravatar for michael.ante
5 months ago by
michael.ante3.3k
Austria/Vienna
michael.ante3.3k wrote:

Hi, Please use the coding button to format your text. This will help others to understand your problem more easily.

To my understanding, htseq is assuming that your bam file is sorted a certain way. This assumption is violated by your file.

You can use the parameter -r pos if your reads are positionally sorted -r name otherwise.

Cheers, Michael

ADD COMMENTlink written 5 months ago by michael.ante3.3k
1

Also note that the default input format for htseq-count is sam, which is why the software yelled at you for not providing the right format. Rather than converting your data to .sam, you should tell htseq-count to expect .bam format.

ADD REPLYlink written 5 months ago by swbarnes26.0k

What is the parameter I should introduce to specify that it is a bam file in the input? Shall I add -f bam in my HTseq code?

ADD REPLYlink modified 5 months ago • written 5 months ago by bnayer260

thanks it got figured out by updating python! and by specifying input as bam and specifying that it is sorted by name. thanks to everyone.

ADD REPLYlink written 5 months ago by bnayer260

Thanks for pointing out how I should format my questions, I have done that now. With regards to the sorting by position or name, I believe the conversion of sam file to bam file and its subsequent sorting using the samtools code leads to sorting by name by default, so I am assuming mine is sorted by name because I continued with default parameters itself. Then in the Htseq website said that the default in Htseq is also by name so I didn't add that argument in my code. I still tried it now but the same error is coming up :(

ADD REPLYlink written 5 months ago by bnayer260
3

subsequent sorting using the samtools code leads to sorting by name

No. Default is co-ordinate sort. Here is samtools sort help.

Sort alignments by leftmost coordinates, or by read name when -n is used. An appropriate @HD-SO sort order header tag will be added or an existing one updated if necessary.

Can you show what samtools view -H your.bam | head -5 looks like?

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax70k
@HD VN:1.0  SO:queryname
@SQ SN:chr1 LN:248956422
@SQ SN:chr10    LN:133797422
@SQ SN:chr11    LN:135086622
@SQ SN:chr12    LN:133275309

this is what it looks like. is this the expected outcome? Thanks. I have added the <-n> in my code as well. The above output is what it looks like after I ran the code with <-n>

ADD REPLYlink modified 5 months ago • written 5 months ago by bnayer260
1

Looks like you file is sorted based on the sequence headers (names). So you will need to use -r name option as noted above by @michael.ante with htseq.

ADD REPLYlink written 5 months ago by genomax70k

Hi Genomax, could you kindly take a moment to explain this to me. When I run the command

samtools view -H mysample.bam | head

Then I get the following output:

@HD VN:1.0  SO:unsorted
@SQ SN:chr1 LN:248956422
@SQ SN:chr10    LN:133797422
@SQ SN:chr11    LN:135086622
@SQ SN:chr12    LN:133275309
@SQ SN:chr13    LN:114364328
@SQ SN:chr14    LN:107043718
@SQ SN:chr15    LN:101991189
@SQ SN:chr16    LN:90338345
@SQ SN:chr17    LN:83257441

First, is this how the expected output is? Second, does that mean that in my bam file I only have reads which have mapped to chromosomes 1, 10, 11, 12, 13, 14, 15, 16, 17? I don't understand what this means...

ADD REPLYlink written 4 months ago by bnayer260

That is what the header of a bam file should look like. Have you looked up what the 'head' command does?

ADD REPLYlink written 4 months ago by swbarnes26.0k

Thank you for asking me that question. Due to my lack of coding knowledge it didn't even strike me that head is a command. I looked it up and found that "head" only returns the first 10 lines. Also, I learnt about "tail, less and more" commands. And upon running the less and more commands I found all the chromosomes in the output. When I run

samtools view -H mysample.bam | more

Then the output looks like this

 @HD    VN:1.0  SO:coordinate
    @SQ SN:chr1 LN:248956422
    @SQ SN:chr10    LN:133797422
    @SQ SN:chr11    LN:135086622
    @SQ SN:chr12    LN:133275309
    @SQ SN:chr13    LN:114364328
    @SQ SN:chr14    LN:107043718
    @SQ SN:chr15    LN:101991189
    @SQ SN:chr16    LN:90338345
    @SQ SN:chr17    LN:83257441
    @SQ SN:chr18    LN:80373285
    @SQ SN:chr19    LN:58617616
    @SQ SN:chr2 LN:242193529
    @SQ SN:chr20    LN:64444167
    @SQ SN:chr21    LN:46709983
    @SQ SN:chr22    LN:50818468
    @SQ SN:chr3 LN:198295559
    @SQ SN:chr4 LN:190214555
    @SQ SN:chr5 LN:181538259
    @SQ SN:chr6 LN:170805979
    @SQ SN:chr7 LN:159345973
    @SQ SN:chr8 LN:145138636
    @SQ SN:chr9 LN:138394717
    @SQ SN:chrX LN:156040895
    @SQ SN:chrY LN:57227415
    @SQ SN:chrMT    LN:16569
    @PG ID:hisat2   PN:hisat2   VN:2.1.0    CL:"/extstor/softwares/hisat2-2.1.0/hisat2-align-s --wrapper basic-0 -q -p 8 --min-intronlen 50 --max-intronlen 250000--dta-cufflinks --new-summary --summary-file /extstor/su
    dhirlab/Reety/Bhavana/Set1_Tina_Trimmed/summary_CFPACG_minus_1.txt -x /extstor/sudhirlab/Reety/Bhavana/Indexed_Genomes/GRCh38/Index_Hisat_GRCh38 -S /extstor/sudhirlab/Reety/Bhavana/Set1_Tina_Trimmed/CFPACG_minus_1_aligned.sam -1 /
    tmp/256472.inpipe1 -2 /tmp/256472.inpipe2"

This looks fine to me now. Thanks again!

ADD REPLYlink written 4 months ago by bnayer260
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1539 users visited in the last hour