How to generate a combined read count txt file with header as file name
1
0
Entering edit mode
3.5 years ago
Bioinfonext ▴ 430

I do have multiple txt file for RNAseq read count, is it possible to generate a single txt file with the file name as column header;

txt file having read count like this: so first column is same in all files.

BGIOSGA000001   0
BGIOSGA000002   12
BGIOSGA000003   0
BGIOSGA000004   0
BGIOSGA000005   0
BGIOSGA000006   0
BGIOSGA000007   0
BGIOSGA000008   15


and txt file name are like this:

Root_T3_S_R7_S56_L001.COUNT.txt
Leaf_T2_F_R5_S8_L001.COUNT.txt


so I want out put like this:

                  Root_T3_S_R7_S56       Leaf_T2_F_R5_S8

BGIOSGA000001         0                           4
BGIOSGA000002        12                           0
BGIOSGA000003         0                           3
BGIOSGA000004         0                           2
BGIOSGA000005         0                           4


I will be thankful for your help.

Kind Regards, Bioinfonext

bash linux awk R • 3.8k views
1
Entering edit mode

You could have used featureCounts which does this when you feed it multiple BAM's on command line. featureCounts options BAM1 BAM2 BAM3. Provide them in the same order you want to group them by so you you don't need to mess with columns afterwards.

0
Entering edit mode

Hi genomax,

I used HTSeq for read count and I am having like 60 read count txt files.

Thanks Bioinfonext

0
Entering edit mode

Consider redoing the counts with featureCounts. You would be done with creating the count matrix in less time than it is going to take you to deal with 60 separate files :-)

0
Entering edit mode
echo -e '\tfile1\tfile2' && join -t $'\t' -1 1 -2 1 <(sort -t$'\t' -k1,1 file1.txt) <(sort -t  $'\t' -k1,1 file2.txt)  ADD REPLY 0 Entering edit mode Hi Pierre, I am having 60 read count txt file so should I keep adding all like you have shown with two files. Thanks Bioinfonext ADD REPLY 0 Entering edit mode Works great ADD REPLY 5 Entering edit mode 3.5 years ago something I wrote a while back (aka, there is likely a better/more efficient approach ;) ) n=0 for i in *.txt do echo$n
name=echo $i | sed 's/_L001*//g' echo -e "ID\t$name" > ${i}_tmp head -n-1$i | cut -f 1,2 | sort -k1 >> ${i}_tmp ((n++)) done paste *_tmp > tmpOK rm -f *_tmp c="-f1" for j in$(seq $n) do d=expr 2 \*$j
c=$c,$d
done
echo $c cut$c tmpOK > final_file

0
Entering edit mode

thanks Lieven, your script works perfectly.

Thanks Again bioinfonext

0
Entering edit mode

After spending 4 hrs trying to combine the files with no luck, this finally worked. Thank you lieven.sterck.

0
Entering edit mode

I am getting head: illegal line count -- -1 and output only col names as file names. but putting a positve head -n value gives only those n rows. what can i have a workaround to get all those rows ?

0
Entering edit mode

you could try tail (look up the syntax for it) ; tail -n+2 (from the top of my head)

alternatively you can also get there using sed (sed '1d' )

0
Entering edit mode

Thanks @lieven.sterck! but since my files has uneven rows it all messed up.

0
Entering edit mode

that should not happen as it only makes sense to make a matrix of counts for mappings against the same reference (can't think of any case where this could be otherwise)

0
Entering edit mode

Agree, that should not happen. But the data I was looking from Geo Omnibus has raw counts files from experiment and surprisingly one of the replciate from their conditions has lesser rows (gene_id). Since it would have been much easier to not to go for download/alignment, I was trying to assemble their raw counts in to a combined matrix for analysis.

0
Entering edit mode

I was trying to assemble their raw counts in to a combined matrix for analysis.

exactly what you best do indeed :)

one of the replciate from their conditions has lesser rows (gene_id)

if it's only one file are you then not better of to 'fix' that one (add a bogus gene_id line or such?)

0
Entering edit mode

Yes its only one file, and agree with your solution. actually was thinking of adding those missing 'gene_id's in gene_id column and placing 0 or just blank ? what you suggest should be resonable ?

0
Entering edit mode

I think you can do either one of them ... zero might work better at first sight though

however, I would personally not really trust that data :/ , is there any mention of why there are less lines in that file? perhaps the file is truncated (when uploading or downloading it)?

0
Entering edit mode

Agree, the file might have got messed up in uploading or something else happened better known to them. i could not find any reason for this file truncation or less rows issue in their writeup.