Question: Sort bam file by coordinates using samtools
0
gravatar for Shahzad
6 months ago by
Shahzad10
Pakistan
Shahzad10 wrote:

I have used samtools to make bam files and it is perfectly showing the data in igv. somehow i am unable to get the counts when using htseq. after going through literature i think i needed to sort bam files according to coordinates. is there anyone can share a command example that how a sorted bam can be generated using samtools. thanks

ps. currently is gives an empty txt file when i run htseq on bam file.

previously i used

samtools sort -@ 8 -o file1.bam file1.sam

how to specify sort to coordinates?

ADD COMMENTlink modified 6 months ago by finswimmer11k • written 6 months ago by Shahzad10

Please edit the post type to Question instead of Tool as Tool is reserved for posts introducing new software tools and/or packages.

Take a look at samtools manual to find out how to sort and index a .sam or .bam file. http://www.htslib.org/doc/samtools.html

samtools sort -T /tmp/aln.sorted -o aln.sorted.bam aln.bam   
samtools index aln.sorted.bam
ADD REPLYlink modified 6 months ago • written 6 months ago by Sej Modha4.1k

samtools sort by coordinates by default so your command should work but you need to add -b param to output in BAM

ADD REPLYlink modified 6 months ago • written 6 months ago by Nicolas Rosewick7.5k
1

Hello Nicolas Rosewick ,

the -b parameter isn't needed anymore when using -o to define the output filename. samtools recognize the filextension automaticly.

fin swimmer

EDIT: And samtools sort never had a -b option. Here it is called -O

-O FORMAT

    Write the final output as sam, bam, or cram.

    By default, samtools tries to select a format based on the -o filename extension; if output is to standard output or no format can be deduced, bam is selected.
ADD REPLYlink modified 6 months ago • written 6 months ago by finswimmer11k

My bad I had samtools view in mind :/

ADD REPLYlink written 6 months ago by Nicolas Rosewick7.5k

it gives an error when i run it with -b

 invalid option -- 'b'

if samtools by default is set to coordinated bam output then can you please suggest any reason why the htseq is generating empty text file when i used the bam file. while when i upload same bam file with indexed bai file in igv genome viewer it works perfectly.

i am using this command for htseq

htseq-count file1.bam ref.gtf -o file1.txt
ADD REPLYlink modified 6 months ago by finswimmer11k • written 6 months ago by Shahzad10
2

I don't know if it is your problem. But the manual say a htseq-count command looks like this.

$ htseq-count [options] <alignment_files> <gff_file>

Furthermore it looks like you have to specify that your input is bam and not sam. If your sort your data by position, you have to tell it to htseq-count as well.

$ htseq-count -f bam -r pos -o file1.sam file1.bam ref.gtf

All this I have found in the manual.

fin swimmer

ADD REPLYlink modified 6 months ago • written 6 months ago by finswimmer11k
2

It's likely the chromosome names don't match. IGV can handle (some) chromosome naming differences (e.g., it knows that 1 and chr1 are the same), htseq-count can't (featureCounts can).

ADD REPLYlink written 6 months ago by Devon Ryan89k

Can you please post a few lines from the .bam file to check that the input .bam file is in correct format. Which version of htseq package are you using?

samtools view file.bam|head

ADD REPLYlink modified 6 months ago • written 6 months ago by Sej Modha4.1k
SRR7042071.15979724     133     1       1687    0       *       =       1687    0       CAAGCATATACCCCCCCAGCATCCATCCTCAATCAGGCAACGTTCGGGCATGCAGCAGCCATTATCATATCCATT    AA---A<<JJ<AA<<-<<--7FJJ--<FJ<7--F--<AFJFJ--<<FJF7JJJJ<J-7A<JJJJ<AFA--77-AJ      YT:Z:UP
SRR7042071.15979724     89      1       1687    60      75M     =       1687    0       GCCATTATCATTGCCATTGCCAATGCGTCTGCTACACTACACACCAAAACCGTACCACTGATCCATATGCCTCTC    JJFJJJJJFJ<<JJJFFFAF7<A<FJA-JJJAJJJJJJJJJAJJJJJ-FJJJFJAJJJJJJJJJJJJJJJFFAAA      AS:i:-3 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:27C47      YT:Z:UP NH:i:1
SRR7042071.28816852     99      1       11150   60      75M     =       11335   260     CTGGATTTCTCAGTAGACAAAGATAGATAACAAAAATAGCTCTCTAGAGTATACACAATATATTAAAAAGTTGTT    AAFFFJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJFJJJJJJFJJJJJJJJJJJJJJJJJJJJJ      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:75 YS:i:0  YT:Z:CP NH:i:1
SRR7042071.11274015     99      1       11158   60      75M     =       11328   245     CTCAGTAGACAAAGATAGATAACAAAAATAGCTCTCTAGAGTATACACAATATATTAAAAAGTTGTTAGAGAGTG    AAF<FJJJFJJJJJJJAJJJJFJJJ<FFFJJJJJJJJJJ<FJJJJJFJFFJ<JFJJF<FJFJFFJJFJFJJJJFF      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:75 YS:i:0  YT:Z:CP NH:i:1
SRR7042071.11274481     99      1       11158   60      75M     =       11328   245     CTCAGTAGACAAAGATAGATAACAAAAATAGCTCTCTAGAGTATACACAATATATTAAAAAGTTGTTAGAGAGTG    AAFFFJFJJJJFJJFJFJFJJJJJJJJJJJJJJJJJJFJ7FJJJJJFJF7FJJJJJJFJJJJJJJJJJJJJJJAF      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:75 YS:i:0  YT:Z:CP NH:i:1
SRR7042071.22370114     99      1       11177   60      8S67M   =       11311   217     TTTTTTTTTAACAAAAATAGCTCTCTAGAGTATACACAATATATTAAAAAGTTGTTAGAGAGTGAAAATATATAG    AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJFJAJA      AS:i:-11        XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:67 YS:i:0  YT:Z:CP NH:i:1
SRR7042071.22370112     99      1       11177   60      8S67M   =       11311   217     TTTTTTTTTAACAAAAATAGCTCTCTAGAGTATACACAATATATTAAAAAGTTGTTAGAGAGTGAAAATATATAG    AAAAAJJFAJFJJFFJAFJJFFAFAAFFJJ<JFJJJAA7AJFFFAJF-<7F7<7JAFF<JJF-<-FAF<JFFFJA      AS:i:-10        XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:67 YS:i:-9 YT:Z:CP NH:i:1
SRR7042071.25380000     99      1       11189   60      75M     =       11444   330     CTCTCTAGAGTATACACAATATATTAAAAAGTTGTTAGAGAGTGAAAATATATAGAAAACAATTTTATACAGATG    AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ<JJF-AAJFJFJJJFA      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:75 YS:i:-3 YT:Z:CP NH:i:1
SRR7042071.12830270     99      1       11191   60      75M     =       11338   222     CTCTAGAGTATACACAATATATTAAAAAGTTGTTAGAGAGTGAAAATATATAGAAAACAATTTTATACAGATGAT    AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJFJJJFFJ<      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:75 YS:i:0  YT:Z:CP NH:i:1
SRR7042071.17500854     99      1       11191   60      75M     =       11349   233     CTCTAGAGTATACACAATATATTAAAAAGTTGTTAGAGAGTGAAAATATATAGAAAACAATTTTATACAGATGAT    A<A<FF<F-FF<FJJFAFJJJJJ<<-F<-<F<<FAA7AFJ-<FAAJJJJJFJ-FFJJAJJ--<FJFJAJFFJ<FA      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:75 YS:i:0  YT:Z:CP NH:i:1
ADD REPLYlink modified 6 months ago by Sej Modha4.1k • written 6 months ago by Shahzad10
> testdata> htseq-count -f bam SRR7042071.bam -r pos -o SRR7042071.txt Sor.gtf
> /home/miniconda2/lib/python2.7/site-packages/HTSeq/__init__.py:9:
> RuntimeWarning: numpy.dtype size changed, may indicate binary
> incompatibility. Expected 96, got 88   from _HTSeq import * usage:
> htseq-count [options] alignment_file gff_file
> **htseq-count: error: too few arguments**
> testdata> htseq-count -f bam SRR7042071.bam -r pos -o SRR7042071.bam SRR7042071.sam Sor.gtf
> /home/miniconda2/lib/python2.7/site-packages/HTSeq/__init__.py:9:
> RuntimeWarning: numpy.dtype size changed, may indicate binary
> incompatibility. Expected 96, got 88   from _HTSeq import * usage:
> htseq-count [options] alignment_file gff_file htseq-count: error: too
> few arguments testdata> htseq-count -f bam SRR7042071.bam -r pos -o
> Sor.gtf
> /home/miniconda2/lib/python2.7/site-packages/HTSeq/__init__.py:9:
> RuntimeWarning: numpy.dtype size changed, may indicate binary
> incompatibility. Expected 96, got 88   from _HTSeq import * usage:
> htseq-count [options] alignment_file gff_file 
> htseq-count: error: too few arguments
>htseq-count -f bam SRR7042071.bam -r pos -o Sorghum.gtf

i tried all the possible variations

ADD REPLYlink modified 6 months ago by WouterDeCoster38k • written 6 months ago by Shahzad10
1

Hi There,

It is important to know what command you are running and understand the parameters specified in the command. You will not get the desired output by randomly trying different combinations of the parameters.

htseq-count -f bam -r pos -o test_htseq-count_output.txt SRR7042071.bam Sorghum.gtf

If you are using a non-Ensembl GTF file then ensure that you are providing relevant tags for -t and -i parameters as described in the manual.

-t FEATURETYPE, --type=FEATURETYPE
                        feature type (3rd column in GFF file) to be used, all
                        features of other type are ignored (default, suitable
                        for Ensembl GTF files: exon)
-i IDATTR, --idattr=IDATTR
                        GFF attribute to be used as feature ID (default,
                        suitable for Ensembl GTF files: gene_id)
ADD REPLYlink modified 6 months ago • written 6 months ago by Sej Modha4.1k

i downloaded this reference and gtf file from ensemble reference genome database. still got the same error with the upper mentioned command too few arguments. I previously tried different parameters but they did not worked that time too. so i went with the default settings

ADD REPLYlink written 6 months ago by Shahzad10
1

Stop changing the ordering:

htseq-count -f bam -r pos -o SRR7042071.txt SRR7042071.bam  Sor.gtf

In many programs you can't intermingle options and positional arguments.

ADD REPLYlink written 6 months ago by Devon Ryan89k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 695 users visited in the last hour