Question: How to sort multiple sam file for HTseq read count
0
gravatar for Bioinfonext
2.5 years ago by
Bioinfonext230
Korea
Bioinfonext230 wrote:

Which tools should I use to sort sam file by read name for raw read count using HTseq and How I can sort multiple sam file.

Picard SortSam or samtools sort

rna-seq • 1.8k views
ADD COMMENTlink modified 2.5 years ago by Arup Ghosh2.6k • written 2.5 years ago by Bioinfonext230
2
for i in `ls -1 *.sam | sed 's/.sam//'`; do echo samtools sort -O sam -o $i\_sorted.sam $i.sam >> cmd_file

you may be able to convert to bam format while sorting

for i in `ls -1 *.sam | sed 's/.sam//'`; do echo samtools sort -O bam -o $i\_sorted.bam $i.sam >> cmd_file
ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by genomax87k

Do we also need to put -n in samtools sort in above loop.

because HTseq needs sam file which is sorted by read name. I am using samtools version samtools/1.4

Thanks

ADD REPLYlink written 2.5 years ago by Bioinfonext230

If you need the file sorted by read headers then yes.

ADD REPLYlink written 2.5 years ago by genomax87k

Hi genomax:

I just found by running single command that I also need to specify -T in samtools to give names to tempory files:

samtools sort -O sam -n -T sample.sort -o sample.sort.sam sample.sam

Could you insert this command in the above loop to sort mutiple sam files.

And could you also explain the loop in breaks so that I can understand what exactly happening:

like:

what this command does: ls -1 *.sam and then what is the role of this:

sed 's/.sam//'

further loop structure I can understand, it going to every input file and generation the output files.

ADD REPLYlink written 2.5 years ago by Bioinfonext230
1
  1. ls command is listing *.sam files one per line and piping/sending that to sed command. sed command is removing (technically replacing .sam with nothing) the .sam extension so you can use the filename (which generally would have sample ID) for parts of the samtools command where you need to use different extensions or add a word to indicate the resulting file is sorted. While unix does not need this/nor does samtools, I find it useful to have that information in the file name so at a glance one can tell if a file is trimmed (e.g.file_R1_trim.fq.gz) or trimmed and sorted (e.g.file_trim_sort.bam).
  2. You can add -T $i.sort to the command loop above, if you need to name the temp files with the name of file.
ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by genomax87k

Hi Genomax,

I want to run above loop on multiple amplicon sequencing pair-end read fastq files to generate cmd file:

I put bash script in same data folder and location for fastq-join is:

/mnt/scratch/users/3052/Amplicon_data/ITS_analysis/soil_ITS/ExpressionAnalysis-ea-utils-bd148d4/clipper/fastq-join

Pair end File name are:

Soil-7_S32_L001_R1_001.fastq.gz Soil-7_S32_L001_R2_001.fastq.gz

Soil-8_S32_L001_R1_001.fastq.gz Soil-8_S32_L001_R2_001.fastq.gz

Soil-9_S42_L001_R1_001.fastq.gz Soil-9_S42_L001_R2_001.fastq.gz

!/bin/bash

for i in ls -1 *.fastq | sed 's/.sam//'; do echo /mnt/scratch/users/3052/Amplicon_data/ITS_analysis/soil_ITS/ExpressionAnalysis-ea-utils-bd148d4/clipper/fastq-join -m 200 -p 1 -o $i_joined.fastq $i.fastq >> cmd_file

This bash script is showing error:

: command not found : command not found ./loop.sh: line 7: syntax error: unexpected end of file

ADD REPLYlink modified 19 months ago • written 19 months ago by Bioinfonext230
1

Or you could use featureCounts ( http://bioinf.wehi.edu.au/featureCounts/ ) and let it do the sorting as needed.

It is easier to convert sam to bam to do further analysis.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by genomax87k
3
gravatar for Devon Ryan
2.5 years ago by
Devon Ryan96k
Freiburg, Germany
Devon Ryan96k wrote:

Recent versions of HTseq-count don't require files to be query sorted. If, for some reason, you really do want to bother, either tool will work fine. As noted in the comments, featureCounts is much faster. Note also that HTseq-count can use BAM files too, so you don't need to actually save SAM files.

ADD COMMENTlink written 2.5 years ago by Devon Ryan96k
1
gravatar for Arup Ghosh
2.5 years ago by
Arup Ghosh2.6k
India
Arup Ghosh2.6k wrote:

HTSeq-count of FeatureCounts does not require sorted files. You can directly generate a matrix file containing all the read counts form different aligned files in one go using featureCounts.

ADD COMMENTlink modified 11 months ago • written 2.5 years ago by Arup Ghosh2.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1610 users visited in the last hour