how to format the featurecounts output ?
1
0
Entering edit mode
10 months ago
sunnykevin97 ▴ 420

HI

Using featurecounts, gene counts information extracted. The output seems little little awkward to handle. How do I make into a proper format. "Chr" name repeating in other columns, suggestions.

Geneid  Chr Start   End Strand  Length  13.bam
LOC117736   NC_046966.1 NC_046966.1 18662   18662   46417   46417   -   -   27756   910
naf1    NC_046966.1 NC_046966.1 51440   51440   57561   57359   -   -   6122    558
vegfc   NC_046966.1 82235   184669  +   102435  127
LOC1177 NC_046966.1 186959  189305  -   2347    4
tenm3   NC_046966.1 1017035 1114474 +   97440   471
dctd    NC_046966.1 NC_046966.1 NC_046966.1 1117869 1121679 1121679 1133921 1134718 1133908 -   -   -   16850   478
cep44   NC_046966.1 NC_046966.1 NC_046966.1 1136953 1136953 1136953 1154202 1154202 1154202 -   -   -   17250   94
fbxo8   NC_046966.1 NC_046966.1 NC_046966.1 NC_046966.1 NC_046966.1 NC_046966.1 1153392 1153444 1154267 1154282 1154407 1154957 1165631 1165631
hand2   NC_046966.1 NC_046966.1 1250592 1250592 1256478 1256478 +   +   5887    0

rna RNA-Seq • 767 views
1
Entering edit mode

This is the right format. It is a tab separate file. GeneID in column 1, with a few columns (generally 5, you seem to have 3 since you used SAF format annotation?) of annotation. Followed by samples in columns with counts.

Tip: You should supply all BAM files in the same featureCounts command to get the complete matrix you need.

0
Entering edit mode

I totally agreed, I use the same cmd for all the bam files. Then why chr "NC_046966.1" is repeated in other columns more than once ? table seems to be unstructured.

1
Entering edit mode

If you look at the file using less -S you will see that there is a definite tab separated structure to the file.

0
Entering edit mode

the output for the lines with repeating sequence names seems incorrect.

as I recall, the expected output of feature counts is simple and straightforward, the chromosome name is listed only once and it should not require the type of cleanup you seem to need

is it possible that this file was created in a different way?

0
Entering edit mode

I used the same standard cmd why their is chr name repeating in other columns, some suggestions.

0
Entering edit mode

Can you post the code you used and which SAF or GTF you used? Did you make the SAF yourself?

1
Entering edit mode
10 months ago
GenoMax 103k

If there are multiple exons for a gene then the chr name, start/end positions are added to the annotation column, when counting at exon and then summarizing at gene level. There are multiple chr name entries for each start and end.

0
Entering edit mode

Now it make sense. Thanks.

0
Entering edit mode

The manual states:

"When counting reads to meta-features (eg. genes) columns ‘Chr’, ‘Start’, ‘End’ and ‘Strand’ may each contain multiple values (separated by semi-colons), which correspond to individual features included in the same meta-feature."

Note how it states that semi colon is used to separate features!

I have never seen output like the one the original poster shows. The format shown in the original post, with a variable number of columns, makes processing it with column-oriented command-line tools nearly impossible. I find it very unlikely that featureCounts would work that way.

I still think the output is post-processed with some other method.

0
Entering edit mode

I concur that output as posted is likely post-processed in some way. Perhaps OP just split data using tabs/semi-colons as delimiters.

0
Entering edit mode

Problem with my csv file. I understand were it went wrong. Thanks for the suggestions.