I am trying to generate gene counts using featureCounts v2.0.1. The syntax I used is a simple one, as shown below:
featureCounts -T 2 -a MtrunA17r5.0-ANR-EGN-r1.7.gtf -F GTF -o A17_T00_1_BASIC1_fC.txt A17_T00_1.bam
The summary file reads as follows:
cat A17_T00_1_BASIC1_fC.txt.summary
Status A17_T00_1.bam
Assigned 13377184
Unassigned_Unmapped 0
Unassigned_Read_Type 0
Unassigned_Singleton 0
Unassigned_MappingQuality 0
Unassigned_Chimera 0
Unassigned_FragmentLength 0
Unassigned_Duplicate 0
Unassigned_MultiMapping 1625691
Unassigned_Secondary 0
Unassigned_NonSplit 0
Unassigned_NoFeatures 334969
Unassigned_Overlapping_Length 0
Unassigned_Ambiguity 2112828
When I peeked into the actual counts file, I noticed some weird entries in column 1 instead of gene_IDs, as shown below. Only 49 out of 52724 lines contain such strange entries.
gene:MtrunA17Chr5g0436531 0
63135360-95f3-44e9-8358-c7975fb56306 92
gene:MtrunA17Chr5g0436551 0
.
.
gene:MtrunA17Chr6g0458631 0
64238042-c1c7-40e1-b07d-e5050f9b567e 0
gene:MtrunA17Chr6g0458661 0
.
.
gene:MtrunA17Chr7g0236431 0
25191532-34dd-4669-9d7b-a20c59e60855 1
gene:MtrunA17Chr7g0236441 0
So I went back to the BAM input file and attempted it's validation with Picard's ValidateSamFile, on a SLURM-based HPCC, as shown below. As visible, it gave me 1 ERROR, and also 49 warnings. Only warning #1 and warning #49 are shown below, rest being similar.
srun --partition=high --time=1:00:00 --mem=24000 java -jar picard.jar ValidateSamFile I=A17_T00_1.bam MODE=VERBOSE
srun: job 22816732 queued and waiting for resources
srun: job 22816732 has been allocated resources
[Thu Jun 11 20:16:35 PDT 2020] picard.sam.ValidateSamFile INPUT=A17_T00_1.bam MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu Jun 11 20:16:35 PDT 2020] Executing as aksrao@c8-62 on Linux 4.15.0-99-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26; Picard version: 2.7.1-SNAPSHOT
ERROR: Read groups is empty
WARNING: Read name HWI-ST797:117:D091UACXX:4:2203:11139:169433, A record is missing a read group
WARNING: Record 1, Read name HWI-ST797:117:D091UACXX:4:2203:11139:169433, NM tag (nucleotide differences) is missing
.
.
WARNING: Record 49, Read name HWI-ST797:117:D091UACXX:4:1107:17611:90721, NM tag (nucleotide differences) is missing
WARNING: Read name HWI-ST797:117:D091UACXX:4:2102:6277:14310, A record is missing a read group
Maximum output of [100] errors reached.
[Thu Jun 11 20:16:35 PDT 2020] picard.sam.ValidateSamFile done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1012924416
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
srun: error: c8-62: task 0: Exited with exit code 1
Here are questions that I have for forum members. Thanks in advance!
1. Is the srun: error: c8-62: task 0: Exited with exit code 1
message indicative that PicardTools' ValidateSamFile
execution is not proper? From this GATK discussion board post, and the corresponding explanation for return codes, it does not appear exit code 1
is for PicardTools execution per se, but I'm not sure?!
2. Regardless of any SLURM or installation problems, how do I fix my BAM file so error and warnings in ValidateSamFile
do not recur, and the weird entries in featureCounts
gene counts also do not recur. Looks like they are both related, i.e. those 49 warnings in ValidateSamFile
for my BAM input, probably each correspond to the 49 weird entries in my featureCounts
gene counts output ?
edit: I suppose it should suffice to follow instructions at this Broad GitHub Picard page?
3. If it is relevant, my BAM file was generated by STAR
mapping, and I know my FASTQ input files were all OK, based on FastQValidator
results. An additional detail is that I ran pass 1 to collect new Splice Junctions for 100+ libraries, concatenated them all, and used them again in pass 2 to obtain final BAM file to use with featureCounts
.
edit - Should I look at the STDERR files captured for 1st and 2nd pass of my STAR mapping runs, for anything specific causing header errors in the BAM output files?
4. I've looked into only one of these 100+ BAM files, and only after pass 2 (as reported in this post). I've not looked at any other BAM files from pass 2, or any at all from pass 1. This is because I'm not sure if my picard.jar ValidateSamFile
is even executing OK.
Please let me know if I should provide any more information to help you help me. Thanks again.
It looks like your GTF file may be malformed. I would start checking that first.
AGAT
is a good toolkit for that. Can you post a couple of lines of that file? featureCounts can count on various features and then summarize on others (e.g. count exons and summarize on gene_id or gene_name). Depending on what you have in your file you may need to be explicit about this in your featureCounts command line.Using Picard can lead down a rabbit hole of read groups and other things. I doubt it is your alignment files that are the problem here.
Thanks for your response @genomax.
Per your suggestion, I tried to install
AGAT
on my local MacOSX and also UBUNTU-based HPCC remote server - install from source OR using Conda - neither gave me a successful install...So I had to try another validation tool = validate_gtf.pl included in the Eval package from the Brent Lab @ WUSTL. I have to take it's results with a pinch of salt because of this older BioStars post. On the flip side, this Perl script may indeed report certain errors in the GTF file, per some others' experiences (link). I did receive error messages from my validation run, as shown below:
However, the author of this GTF file, in the annotation group says this GTF file works perfectly well for him, when used with his BAM files, to report gene counts, using
featureCounts
! As a control experiment, I've requested a BAM file that works for him, withfeatureCounts
, so I can use it as a positive control for my ownfeatureCounts
runs, using this GTF file. It can be downloaded from this public link. Would it be too much to request you to check if this file is valid? :)Regardless of doubts about GTF file validity, I am now even more suspicious whether the 49 weird lines in the
featureCounts
counts output and the exact same number of 49 warnings in my BAM file reported by PicardTools'ValidateSamFile
, already found in my 1st post of this thread, indicate that my 2 pass STAR mapping protocol, adapted from this BioStars post by user caggtaagtat , might have an error somewhere - especially where I concatenate all the *SJ.out.tab files from pass 1, to use for pass 2 genome indexing and genome mapping... I think I'll run PicardTools'ValidateSamFile
on individual BAM files created during pass 1, while awaiting suggestion from you and re-validation by the GTF file author. Thanks again.Since you are a BBTools user I would like to point out that bbmap.sh/featureCounts results are concordant with STAR/salmon so if the two-pass alignment is causing you problems consider using bbmap.
What problems did you have with conda install of AGAT?
Can you post an example line of the annotation for one of the genes above and one that does not show any problems?