Question

Counting data with ht seq

0

Entering edit mode

3.3 years ago

aranyak111 • 0

I am trying to use a pipeline for RNA seq which has been used by one of my colleagues. I have got in total of 8 files. These files are 8 count files of RNA seq data. I am using less command

less wth1count.txt
less wth2count.txt
less wth3count.txt
less wth4count.txt
less 99h1count.txt
less 99h2count.txt
less 99h3count.txt
less 99h4count.txt

and to count the number of lines for each of the 8 files the command that I use is

wc -l wth1count.txt
wc -l wth2count.txt
wc -l wth3count.txt
wc -l wth4count.txt
wc -l 99h1count.txt
wc -l 99h2count.txt
wc -l 99h3count.txt
wc -l 99h4count.txt

My subsequent goal is to filter out reads that are lower than 5 and for that I am using python scripts

filtercounts.py

#!/usr/bin/python

#Load relevant modules

import sys

#Read in the HTseq gene count output files for each of the 8 samples

inputlist = [sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5], sys.argv[6], sys.argv[7], sys.argv[8]]  

##Determine the genes with sufficient coverage
#Set up a variable to hold genes with sufficient coverage in any of the 8 samples

genelist = []

#For each HTseq file

for i in range(len(inputlist)):
    #Open it
    inputfile = open(inputlist[i], 'r')

    #For each line
    for line in inputfile:
        #Determine the number of reads/transcripts for this gene
        splitline = line.strip().split('\t')
        #Does it pass the threshold?
        if int(splitline[1]) >= 5: 
            #If so, add it to list
            if not splitline[0] in genelist:
                genelist.append(splitline[0])
    #Close the file
    inputfile.close()

#Write out the list of sufficiently expressed genes
outputlist = open(sys.argv[9], 'w') 

for gene in genelist:
    outputlist.write(gene + '\n')

outputlist.close()

##Filter each of the HTseq files for only sufficiently expressed genes across our samples
#For each HTseq file
for i in range(len(inputlist)):
    #Open it
    inputfile = open(inputlist[i], 'r')
    #Open an output 'filtered' file
    outputfile = open(sys.argv[i+10], 'w')
    #For each line
    for line in inputfile:
        #Determine the gene
        splitline = line.strip().split('\t')
        #Is it in our list?
        if splitline[0] in genelist:
            #If so, write out the original line
            outputfile.write(line)
    #Close the files
    inputfile.close()
    outputfile.close()

Problem - When I am trying to execute the script I am getting the error "index list out of range"

I am also not sure where in the script I can feed the string of files as input?

Any help will be useful.

RNA-seq alignment • 1.1k views

ADD COMMENT • link updated 7 months ago by Ram 43k • written 3.3 years ago by aranyak111 • 0

0

Entering edit mode

Please invest some time into formatting your post properly. Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting

Also, avoid blank lines between lines of code - that just makes code harder to read and adds to the character count unnecessaily. See:

for i in range(len(inputlist)):
    #Open it
    inputfile = open(inputlist[i], 'r')
    #Open an output 'filtered' file
    outputfile = open(sys.argv[i+10], 'w')

vs your double-spaced:

for i in range(len(inputlist)):

    #Open it

    inputfile = open(inputlist[i], 'r')

    #Open an output 'filtered' file

    outputfile = open(sys.argv[i+10], 'w')

ADD REPLY • link 3.3 years ago by Ram 43k

0

Entering edit mode

I am getting the error "index list out of range"

Please post the full error.

ADD REPLY • link 3.3 years ago by WouterDeCoster 47k

0

Entering edit mode

#!/usr/bin/python

#Load relevant modules
import sys

#Read in the HTseq gene count output files for each of the 8 samples
inputlist = [sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5], sys.argv[6], sys.argv[7], sys.argv[8]] 

#Determine the genes with sufficient coverage
#Set up a variable to hold genes with sufficient coverage in any of the 7 samples
genelist = []

#For each HTseq file
for i in range(len(inputlist)):
    #Open it
    inputfile = open(inputlist[i], 'r')
    #For each line
    for line in inputfile:
        #Determine the number of reads/transcripts for this gene
        splitline = line.strip().split('\t')
        #Does it pass the threshold?
        if int(splitline[1]) >= 5: 
            #If so, add it to list
            if not splitline[0] in genelist:
                genelist.append(splitline[0])
    #Close the file
    inputfile.close()

#Write out the list of sufficiently expressed genes
outputlist = open(sys.argv[8], 'w') 
for gene in genelist:
    outputlist.write(gene + '\n')
outputlist.close()

##Filter each of the HTseq files for only sufficiently expressed genes across our samples
#For each HTseq file
for i in range(len(inputlist)):
    #Open it
    inputfile = open(inputlist[i], 'r')
    #Open an output 'filtered' file
    outputfile = open(sys.argv[i+9], 'w')
    #For each line
    for line in inputfile:
        #Determine the gene
        splitline = line.strip().split('\t')
        #Is it in our list?
        if splitline[0] in genelist:
            #If so, write out the original line
            outputfile.write(line)
    #Close the files
    inputfile.close()
    outputfile.close()

The full traceback is

Traceback (most recent call last):
  File "/gpfs/ycga/project/nicoli/ag2646/99H50/filtercounts.py", line 9, in <module>
    inputlist = [sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5], sys.argv[6], sys.argv[7], sys.argv[8]]
IndexError: list index out of range

ADD REPLY • link updated 3.3 years ago by Ram 43k • written 3.3 years ago by aranyak111 • 0

0

Entering edit mode

I would highly recommend you learn to separate your code into functions. For example:

def htseq_remove_low_genes( file_in, file_out, threshold ):
  # output a new file with low genes filtered out
  with open(file_in,'r') as f, open(file_out,'w') as fout:
    for r in f:
      if int(r.strip.split('\t')[1]) > threshold):
        fout.write(r)
  return file_out

You can then just call this function whenever you need to.

ADD REPLY • link 3.3 years ago by jerry ▴ 130

0

Entering edit mode

Does this answer OP's question? If not, it should be a comment, not an answer. Please read the posts under the how-to tag to contribute to and use the forum better.

ADD REPLY • link 3.3 years ago by Ram 43k

0

Entering edit mode

Oh I see, understood - will use appropriately next time. Thanks for helping to moderate this forum!

ADD REPLY • link 3.3 years ago by jerry ▴ 130

0

Entering edit mode

No problem, I've moved you answer to a comment this time.

ADD REPLY • link 3.3 years ago by Ram 43k

score 1 · Answer 1 · 2020-12-15

If you want to use python, your life will be much easier if you use pandas to make data frames out of this.
You're going to use R in the next step anyway, so why not read it in (using DESeq2) with DESeqDatasetFromHTSeqCount and then:

.

KEEP = which(apply(counts(dds), 1, min) >= 5)
dds = dds[KEEP]

That is vastly simpler than what you're doing.