Question: how to format the featurecounts output ?
0
gravatar for sunnykevin97
15 days ago by
sunnykevin9750
sunnykevin9750 wrote:

HI

Using featurecounts, gene counts information extracted. The output seems little little awkward to handle. How do I make into a proper format. "Chr" name repeating in other columns, suggestions.

Geneid  Chr Start   End Strand  Length  13.bam                              
LOC117736   NC_046966.1 NC_046966.1 18662   18662   46417   46417   -   -   27756   910             
naf1    NC_046966.1 NC_046966.1 51440   51440   57561   57359   -   -   6122    558             
vegfc   NC_046966.1 82235   184669  +   102435  127                             
LOC1177 NC_046966.1 186959  189305  -   2347    4                               
tenm3   NC_046966.1 1017035 1114474 +   97440   471                             
dctd    NC_046966.1 NC_046966.1 NC_046966.1 1117869 1121679 1121679 1133921 1134718 1133908 -   -   -   16850   478
cep44   NC_046966.1 NC_046966.1 NC_046966.1 1136953 1136953 1136953 1154202 1154202 1154202 -   -   -   17250   94
fbxo8   NC_046966.1 NC_046966.1 NC_046966.1 NC_046966.1 NC_046966.1 NC_046966.1 1153392 1153444 1154267 1154282 1154407 1154957 1165631 1165631
hand2   NC_046966.1 NC_046966.1 1250592 1250592 1256478 1256478 +   +   5887    0
rna-seq rna • 126 views
ADD COMMENTlink written 15 days ago by sunnykevin9750
1

This is the right format. It is a tab separate file. GeneID in column 1, with a few columns (generally 5, you seem to have 3 since you used SAF format annotation?) of annotation. Followed by samples in columns with counts.

Tip: You should supply all BAM files in the same featureCounts command to get the complete matrix you need.

ADD REPLYlink modified 15 days ago • written 15 days ago by genomax87k

I totally agreed, I use the same cmd for all the bam files. Then why chr "NC_046966.1" is repeated in other columns more than once ? table seems to be unstructured.

ADD REPLYlink written 15 days ago by sunnykevin9750
1

If you look at the file using less -S you will see that there is a definite tab separated structure to the file.

ADD REPLYlink written 15 days ago by genomax87k

the output for the lines with repeating sequence names seems incorrect.

as I recall, the expected output of feature counts is simple and straightforward, the chromosome name is listed only once and it should not require the type of cleanup you seem to need

is it possible that this file was created in a different way?

ADD REPLYlink modified 15 days ago • written 15 days ago by Istvan Albert ♦♦ 84k

I used the same standard cmd why their is chr name repeating in other columns, some suggestions.

ADD REPLYlink written 15 days ago by sunnykevin9750

Can you post the code you used and which SAF or GTF you used? Did you make the SAF yourself?

ADD REPLYlink written 15 days ago by science_lizard0
1
gravatar for genomax
15 days ago by
genomax87k
United States
genomax87k wrote:

If there are multiple exons for a gene then the chr name, start/end positions are added to the annotation column, when counting at exon and then summarizing at gene level. There are multiple chr name entries for each start and end.

ADD COMMENTlink modified 15 days ago • written 15 days ago by genomax87k

Now it make sense. Thanks.

ADD REPLYlink written 15 days ago by sunnykevin9750

The manual states:

"When counting reads to meta-features (eg. genes) columns ‘Chr’, ‘Start’, ‘End’ and ‘Strand’ may each contain multiple values (separated by semi-colons), which correspond to individual features included in the same meta-feature."

Note how it states that semi colon is used to separate features!

I have never seen output like the one the original poster shows. The format shown in the original post, with a variable number of columns, makes processing it with column-oriented command-line tools nearly impossible. I find it very unlikely that featureCounts would work that way.

I still think the output is post-processed with some other method.

ADD REPLYlink modified 15 days ago • written 15 days ago by Istvan Albert ♦♦ 84k

I concur that output as posted is likely post-processed in some way. Perhaps OP just split data using tabs/semi-colons as delimiters.

ADD REPLYlink written 15 days ago by genomax87k

Problem with my csv file. I understand were it went wrong. Thanks for the suggestions.

ADD REPLYlink written 14 days ago by sunnykevin9750
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1368 users visited in the last hour