Question: Alignment with STAR produces unique bam files, but identical output from bedtools multicov
0
gravatar for dec986
3.0 years ago by
dec986180
United States
dec986180 wrote:

Hello,

I have a strange technical problem where many different (and unique) alignment files from STAR somehow get identical output from bedtools multicov. I used the same procedure with Tophat and it worked fine:

I use the following command in STAR:

STAR --outSAMattributes All --runThreadN 8 --genomeDir /home/con/GENE_DATA/mm10 --readFilesIn $fastq1 $fastq2

the aligned bam files are different, sorted and indexed by samtools:

con@ubuntu:~/Embryo$ ll Epiblast/Male/HFD/?/4/sorted.bam
-rw-rwxr-- 1 con david 384127141 May  4 04:53 Epiblast/Male/HFD/1/4/sorted.bam\
-rw-rwxr-- 1 con david 333413416 May  4 00:43 Epiblast/Male/HFD/2/4/sorted.bam
-rw-rwxr-- 1 con david 575693742 May  4 02:28 Epiblast/Male/HFD/3/4/sorted.bam
-rw-rwxr-- 1 con david 464273054 May  4 05:57 Epiblast/Male/HFD/4/4/sorted.bam

and processing the aligned bam files from bedtools:

con@ubuntu:~/Embryo$ bedtools multicov -split -D -bams Epiblast/Male/HFD/1/4/sorted.bam -bed ~/GENE_DATA/mm10/mm10.bed > tmp1
con@ubuntu:~/Embryo$ bedtools multicov -split -D -bams Epiblast/Male/HFD/2/4/sorted.bam -bed ~/GENE_DATA/mm10/mm10.bed > tmp2
con@ubuntu:~/Embryo$ bedtools multicov -split -D -bams Epiblast/Male/HFD/3/4/sorted.bam -bed ~/GENE_DATA/mm10/mm10.bed > tmp3
con@ubuntu:~/Embryo$ bedtools multicov -split -D -bams Epiblast/Male/HFD/4/4/sorted.bam -bed ~/GENE_DATA/mm10/mm10.bed > tmp4

gives four identical files:

con@ubuntu:~/Embryo$ ll tmp?
-rw-r--r-- 1 con david 5270039 May  5 08:18 tmp1
-rw-r--r-- 1 con david 5270039 May  5 08:18 tmp2
-rw-r--r-- 1 con david 5270039 May  5 08:18 tmp3
-rw-r--r-- 1 con david 5270039 May  5 08:18 tmp4
con@ubuntu:~/Embryo$ diff tmp1 tmp2
con@ubuntu:~/Embryo$ diff tmp1 tmp3

con@ubuntu:~/Embryo$ diff tmp1 tmp4

Why is the output identical? How can I fix this?

star samtools bedtools • 1.2k views
ADD COMMENTlink modified 2.9 years ago • written 3.0 years ago by dec986180

I guess they are identical in size because they originate from the same bed file, but the numbers/counts are different?

ADD REPLYlink written 3.0 years ago by WouterDeCoster38k

is it because of the format of the bedfile?

what is the correct format of mm10.bed?

ADD REPLYlink written 3.0 years ago by dec986180

Could you check that files are different in content, but equal in size?

You could just check using less or head or tail or use

diff file1 file2
ADD REPLYlink written 3.0 years ago by WouterDeCoster38k

hi WouterDeCoster,

I did that, the files are exactly equal:

con@ubuntu:~/Embryo$ diff tmp1 tmp2 con@ubuntu:~/Embryo$ diff tmp1 tmp3 con@ubuntu:~/Embryo$ diff tmp1 tmp4

ADD REPLYlink written 3.0 years ago by dec986180

Every single gene is showing a count = 0, if this helps.

Why would bedtools multicov show a count of 0 for everything?

ADD REPLYlink written 2.9 years ago by dec986180
1

Common problem is that your bed file has 'chr' notation, while your bam file (and genome index) has integer notation, or vice versa. Could you check that?

ADD REPLYlink written 2.9 years ago by WouterDeCoster38k

Hi WouterDeCoster,

my genome reference file looks like this:

con@ubuntu:/mnt/data/con/Embryo/Epiblast/Female/HFD/4/1$ head /home/con/GENE_DATA/mm10/mm10.bed
chr1    3214481 3671498 NM_001011874    0   -   3216021 3671348 0   3   2487,200,947,   0,207220,456070,
chr1    4290845 4409241 NM_001195662    0   -   4292980 4409187 0   4   2167,172,636,72,    0,61064,61356,118324,
chr1    4343506 4360314 NM_011283   0   -   4344599 4352825 0   4   6585,172,636,115,   0,8403,8695,16693,
chr1    4490927 4497354 NM_001289466    0   -   4491715 4495155 0   4   1741,92,63,1064,    0,2844,4208,5363,
chr1    4490927 4497354 NM_011441   0   -   4491715 4493406 0   5   1741,367,92,807,1064,   0,2172,2844,4208,5363,
chr1    4490927 4497354 NM_001289464    0   -   4491715 4493406 0   4   1741,391,92,1064,   0,2172,2844,5363,
chr1    4490927 4497354 NM_001289465    0   -   4491715 4495155 0   4   1741,92,807,1064,   0,2844,4208,5363,

and the header of the bam file sorted.bam is

con@ubuntu:/mnt/data/con/Embryo/Epiblast/Female/HFD/4/1$ samtools view -H sorted.bam
@HD VN:1.4  SO:coordinate
@SQ SN:1    LN:195471971
@SQ SN:10   LN:130694993
@SQ SN:11   LN:122082543
@SQ SN:12   LN:120129022
@SQ SN:13   LN:120421639
@SQ SN:14   LN:124902244
@SQ SN:15   LN:104043685
@SQ SN:16   LN:98207768
@SQ SN:17   LN:94987271
@SQ SN:18   LN:90702639
@SQ SN:19   LN:61431566
@SQ SN:2    LN:182113224
@SQ SN:3    LN:160039680
@SQ SN:4    LN:156508116
@SQ SN:5    LN:151834684
@SQ SN:6    LN:149736546
@SQ SN:7    LN:145441459
@SQ SN:8    LN:129401213
@SQ SN:9    LN:124595110
ADD REPLYlink modified 2.9 years ago by genomax65k • written 2.9 years ago by dec986180
1

That's something we can fix easily ;-) Run the following on your gtf file:

sed 's/^chr//' yourfile.gtf > newfile.gtf

This will change the "chr1" to "1" etc.

ADD REPLYlink written 2.9 years ago by WouterDeCoster38k

I tried this, I'm still getting the empty bedtools file :(

ADD REPLYlink written 2.9 years ago by dec986180

Hate to ask but did you use "newfile.gtf" for the last step? Are there any errors or just an empty file?

ADD REPLYlink written 2.9 years ago by genomax65k

Wonder if this is happening because you are not providing all your BAM's with a single multicov command.

bedtools multicov, reports the count of alignments from multiple position-sorted and indexed BAM files that overlap intervals in a BED file.

Multicov expects multiple BAM input.

Use genomecov/coverage if you need independent coverage values.

ADD REPLYlink written 2.9 years ago by genomax65k

Hi genomax2,

on the bedtools manual page for multicov, http://bedtools.readthedocs.io/en/latest/content/tools/multicov.html

multicov works for single bam files.

Also, as you suggested, I tried the coverage option, but this produces output like

1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  332 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  333 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  334 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  335 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  336 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  337 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  338 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  339 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  340 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  341 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  342 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  343 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  344 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  345 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  346 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  347 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  348 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  349 0

which I can't use because it doesn't have the transcript information contained in mm10.bed.

I just want counts from bam file, I don't know what's wrong :(

ADD REPLYlink modified 2.9 years ago by genomax65k • written 2.9 years ago by dec986180
3

You want counts for what (genes or the features in your BED file)? You could use featureCounts (use a GTF file for MM10 if you want gene level counts or modify your BED file to a minimal GTF/SAF format) to get the counts.

ADD REPLYlink written 2.9 years ago by genomax65k
1
gravatar for dec986
2.9 years ago by
dec986180
United States
dec986180 wrote:

The answer here is that for some reason bedtools cannot process the bam file, but featureCounts can, this is the solution given by genomax2.

ADD COMMENTlink written 2.9 years ago by dec986180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 979 users visited in the last hour