Question: How To Handle 'Mate Mapped To Different Chr' In Rna-Seq Data Analysis ?
0
gravatar for geek_y
5.3 years ago by
geek_y9.4k
Barcelona/CRG/London/Imperial
geek_y9.4k wrote:

I have aligned my paired-end reads to Hg19 build from Ensemble using tophat2. Tophat 'align_stats.txt' shows that 96% overall alignment. The command I used is

tophat2 -o OutDir -p 25 -G Homo_sapiens.GRCh37.70.gtf HG_19_GRCh37 CCGTCC_R1_001.fastq CCGTCC_R2_001 .fastq

A part of my GTF file looks like:

1       protein_coding  CDS     35721   35736   .       -       0        gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "1"; gene_name "FAM138A"; gene_biotype "protein_coding"; transcript_name "FAM138A-001"; protein_id "ENSP00000409362";

1       protein_coding  start_codon     35734   35736   .       -       0        gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "1"; gene_name "FAM138A"; gene_biotype "protein_coding"; transcript_name "FAM138A-001";

1       protein_coding  exon    35277   35481   .       -       .        gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "2"; gene_name "FAM138A"; gene_biotype "protein_coding"; transcript_name "FAM138A-001"; exon_id "ENSE00001669267";

1       protein_coding  CDS     35277   35481   .       -       2        gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "2"; gene_name "FAM138A"; gene_biotype "protein_coding"; transcript_name "FAM138A-001"; protein_id "ENSP00000409362";

1       protein_coding  exon    34554   35174   .       -       .        gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "3"; gene_name "FAM138A"; gene_biotype "protein_coding"; transcript_name "FAM138A-001"; exon_id "ENSE00001727627";

1       protein_coding  CDS     35141   35174   .       -       1        gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "3"; gene_name "FAM138A"; gene_biotype "protein_coding"; transcript_name "FAM138A-001"; protein_id "ENSP00000409362";

1       protein_coding  stop_codon      35138   35140   .       -       0        gene_id "ENSG00000237613"; transcript_id "ENST00000417324"; exon_number "3"; gene_name "FAM138A"; gene_biotype "protein_coding"; transcript_name "FAM138A-001";

The output of samtools flagstat accepted_hits.bam file looks like:

62109999 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
62109999 + 0 mapped (100.00%:-nan%)
57657067 + 0 paired in sequencing
28804927 + 0 read1
28852140 + 0 read2
4916 + 0 properly paired (0.01%:-nan%)
55780192 + 0 with itself and mate mapped
1876875 + 0 singletons (3.26%:-nan%)
53443292 + 0 with mate mapped to a different chr
26218346 + 0 with mate mapped to a different chr (mapQ>=5)

I am afraid how to go for read counts. The flag stats output says 53443292 reads with mate mapped to different cur. Does the read counting tools like bedttols multicov will take care of it ? or should I change any parameters in tophat2 command ?

tophat2 rna-seq • 2.1k views
ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by geek_y9.4k
0
gravatar for Pavel Senin
5.3 years ago by
Pavel Senin1.9k
Los Alamos, NM
Pavel Senin1.9k wrote:

I think that you might run it not as they specify:

Usage: tophat [options]* <genome_index_base> <reads1_1[,...,readsN_1]> [reads1_2,...readsN_2] 
When running TopHat with paired reads it is critical that the *_1 files an the *_2 files appear in separate comma-delimited lists, and that the order of the files in the two lists is the same.

i.e. it didn't recognize that your reads are paired, so the maximum insert size constraint was ignored, and so on.

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by Pavel Senin1.9k

The actual command I used is

tophat2 -o OutDir -p 25 -G Homo_sapiens.GRCh37.70.gtf HG_19_GRCh37 CCGTCC_R1_001.fastq CCGTCC_R2_001 .fastq

space seperated R1 and R2

ADD REPLYlink written 5.3 years ago by geek_y9.4k

you can sort your sam/bam file and use Picard's CollectInsertSizeMetrics.jar to estimate the insert size and see how it differs from what you specified (the tophat's default values in this case). i think that when you call flagstat samtools automatically infers reads pairing by their names - which is a bit confusing.

ADD REPLYlink written 5.3 years ago by Pavel Senin1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1817 users visited in the last hour