Question: How to remove lines/headers from a SAM file after combined from different lanes
gravatar for mudithekanayake
8 months ago by
mudithekanayake10 wrote:


I'm trying to call peaks using MACS2. Now I have merged 4 SAM files from different lanes into one. But it contains extra lines because of the merge step. Is there any simple way to remove the unnecessary lines? First few lines of the data looks like this. I need to remove lines 24, 25, 26 (Last 3 @PG lines before the start of the sequence lines). Files are very large.

@HD     VN:1.0  SO:unsorted
@SQ     SN:10   LN:130694993
@SQ     SN:11   LN:122082543
@SQ     SN:12   LN:120129022
@SQ     SN:13   LN:120421639
@SQ     SN:14   LN:124902244
@SQ     SN:15   LN:104043685
@SQ     SN:16   LN:98207768
@SQ     SN:17   LN:94987271
@SQ     SN:18   LN:90702639
@SQ     SN:19   LN:61431566
@SQ     SN:1    LN:195471971
@SQ     SN:2    LN:182113224
@SQ     SN:3    LN:160039680
@SQ     SN:4    LN:156508116
@SQ     SN:5    LN:151834684
@SQ     SN:6    LN:149736546
@SQ     SN:7    LN:145441459
@SQ     SN:8    LN:129401213
@SQ     SN:9    LN:124595110
@SQ     SN:X    LN:171031299
@SQ     SN:Y    LN:91744698
@PG     ID:bowtie2      PN:bowtie2      VN:      CL:"/opt/rit/spack-app/linux-rhel7-x86_64/gcc-4.8.5/bowtie2- --wrapper basic-0 -x /work/LAS/geetu-lab/hhvu/index_files/mm10/bowtie2-ensembl/mm10 -S /work/LAS/geetu-lab/hhvu/project1/chip-seq/1_bowtie2/I10_S64_L005.bam -U I10_S64_L005_R1_001.fastq"
@PG     ID:bowtie2-2F3DB176     PN:bowtie2      VN:      CL:"/opt/rit/spack-app/linux-rhel7-x86_64/gcc-4.8.5/bowtie2- --wrapper basic-0 -x /work/LAS/geetu-lab/hhvu/index_files/mm10/bowtie2-ensembl/mm10 -S /work/LAS/geetu-lab/hhvu/project1/chip-seq/1_bowtie2/I10_S64_L006.bam -U I10_S64_L006_R1_001.fastq"
@PG     ID:bowtie2-54BA705F     PN:bowtie2      VN:      CL:"/opt/rit/spack-app/linux-rhel7-x86_64/gcc-4.8.5/bowtie2- --wrapper basic-0 -x /work/LAS/geetu-lab/hhvu/index_files/mm10/bowtie2-ensembl/mm10 -S /work/LAS/geetu-lab/hhvu/project1/chip-seq/1_bowtie2/I10_S64_L007.bam -U I10_S64_L007_R1_001.fastq"
@PG     ID:bowtie2-31E64087     PN:bowtie2      VN:      CL:"/opt/rit/spack-app/linux-rhel7-x86_64/gcc-4.8.5/bowtie2- --wrapper basic-0 -x /work/LAS/geetu-lab/hhvu/index_files/mm10/bowtie2-ensembl/mm10 -S /work/LAS/geetu-lab/hhvu/project1/chip-seq/1_bowtie2/I10_S64_L008.bam -U I10_S64_L008_R1_001.fastq"
J00102:65:HF3WFBBXY:6:1101:2777:1806    16      12      77384783        1       50M     *       0       0       TCCTTTCCCAATCTGTTGGTGGTCTTTTTGTCTTACTGACGGTGTCTTTT      FFJFF7JJJFAJJJFJJJFA-JJJJJJJJFFJJJJJJJJJJJJJJFAFAA      AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:50 YT:Z:UU
J00102:65:HF3WFBBXY:8:1101:2341:1965    0       13      73416403        40      50M     *       0       0       GAGAAGGGAAGGGAACAATTGGAAGGGGGATTTGTGAGATGACGATTGGG      AAF<AFJ<7FFJJJJFFJJJJJJ<JJJJJA<FFJF-<A7<-F-<7-<A-A      AS:i:-6 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:40C1A7     YT:Z:UU
rna-seq chip-seq R genome • 254 views
ADD COMMENTlink written 8 months ago by mudithekanayake10

Did you merge the SAM files with samtools merge? And do you really need to remove these @PG lines? Do they cause any problem downstream?

ADD REPLYlink written 8 months ago by h.mon31k

Actually I don't have much idea about that. I was told to do so. Yes, I used samtools merge.

ADD REPLYlink written 8 months ago by mudithekanayake10

There is absolutely no need to do that, macs does not care about any of these PF header lines. In fact it is good to keep them. That allows tracking back what was done with the single files prior to the merge.

ADD REPLYlink written 8 months ago by ATpoint40k

I think you are correct. I called peaks using that file and I did not get any errors. But when I tried to visualize it using UCSC Genome browser I got a error that it does not recognize the first line. So I thought may be that was the issue. My narrowPeak file looks like this. Do you have any suggestion what to do?

1   4491713 4491913 testONE_peak_1  37  .   2.86003 3.72580 1.66717 80
1   4492782 4493153 testONE_peak_2  27  .   2.22933 2.71107 0.82442 242
1   4571554 4572026 testONE_peak_3  72  .   4.16005 7.21000 4.85200 282
1   4739486 4739879 testONE_peak_4  64  .   3.63128 6.42948 4.14394 275
1   4784848 4785061 testONE_peak_5  31  .   2.24218 3.19160 1.21569 133
ADD REPLYlink written 8 months ago by mudithekanayake10

These are first few lines.

ADD REPLYlink written 8 months ago by mudithekanayake10

I guarantee you that these lines have no impact on anything. @PR header lines are always ignored.

ADD REPLYlink written 8 months ago by ATpoint40k
  Error File 'testONE_peaks.narrowPeak' - Unrecognized format line 1 of file: 1 4491713 4491913 testONE_peak_1  37  .   2.86003 3.72580 1.66717 80 (note: chrom names are case sensitive, e.g.: correct: 'chr1', incorrect: 'Chr1', incorrect: '1')

When I tried to visualize in UCSC Genome browser, I got above error. So I added chr to the chromosome column instead of 1. Then I'm getting following error.

 Error File 'testONE_peaks_NEW2.narrowPeak' - Error line 1 of custom track: thickStart out of range (chromStart to chromEnd, or 0 if no CDS)

Do you know what to do?

ADD REPLYlink modified 8 months ago • written 8 months ago by mudithekanayake10

UCSC browser is interpreting the file as being in the BED format, but Macs2 produces a file in narrowPeak format. The first 6 columns of both formats are the same, but the additional 4 columns that Macs2 produces are interpreted differently in a standard BED file. You can either tell UCSC that it is in narrowPeak format (I think I answered one of your questions a few days ago with instructions on how to do this), or just turn it into a 6-column BED file using the cut command in UNIX (cut -f1,2,3,4,5,6 testOne_peaks_NEW2.narrowPeak > testOne_peaks_NEW2.bed).

ADD REPLYlink modified 8 months ago • written 8 months ago by colin.kern930
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 967 users visited in the last hour