Question: (Closed) Script to extract
0
gravatar for alim.hcu
4.2 years ago by
alim.hcu10
alim.hcu10 wrote:

Dear,

I have two file. In one there is start and end position. I want to use these start and end position to extract information in another txt file( start and end position in column 2)

File 1 format

CcLG11:226909-229893
CcLG11:243545-252363
CcLG11:468465-470135
CcLG11:599949-606369
CcLG11:702863-705567
CcLG11:732897-733190
CcLG11:777699-778472
CcLG11:836376-837089
CcLG11:863645-868932
CcLG11:885839-889335
CcLG11:894027-895799

File2 format

CcLG01  114 -   CG  0.000   2.00    0   2   0   0   0.000   0.658
CcLG01  136 -   CG  0.000   1.00    0   1   0   0   0.000   0.793
CcLG01  243 -   CG  0.000   1.00    0   1   0   0   0.000   0.793
CcLG01  1272    +   CG  0.000   1.00    0   1   1   1   0.000   0.793
CcLG01  1273    -   CG  1.000   1.00    1   1   1   1   0.207   1.000
CcLG01  1277    +   CG  1.000   1.00    1   1   1   1   0.207   1.000
CcLG01  1278    -   CG  1.000   1.00    1   1   1   1   0.207   1.000
CcLG01  1281    +   CG  1.000   1.00    1   1   1   1   0.207   1.000
CcLG01  1282    -   CG  1.000   1.00    1   1   1   1   0.207   1.000
CcLG01  1287    +   CG  1.000   1.00    1   1   1   1   0.207   1.000
CcLG01  1288    -   CG  1.000   1.00    1   1   1   1   0.207   1.000
CcLG01  1296    +   CG  1.000   1.00    1   1   0   0   0.207   1.000
CcLG01  1327    +   CG  1.000   3.00    3   3   10  10  0.438   1.000
CcLG01  1328    -   CG  1.000   12.00   12  12  3   3   0.757   1.000
CcLG01  1347    +   CG  0.792   7.58    6   8   36  38  0.438   0.949
CcLG01  1348    -   CG  1.000   38.00   38  38  8   8   0.908   1.000
CcLG01  1351    +   CG  0.891   6.74    6   8   32  38  0.513   0.984
CcLG01  1352    -   CG  1.000   28.50   32  38  6   8   0.881   1.000
CcLG01  1359    +   CG  1.000   8.00    8   8   38  38  0.676   1.000
script write • 1.1k views
ADD COMMENTlink modified 4.1 years ago by guardianpatch0 • written 4.2 years ago by alim.hcu10
4

Looks like people are too busy to spend time to write appropriate title and making the description understandable.

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by geek_y11k

Do you have any solution.

ADD REPLYlink written 4.2 years ago by alim.hcu10

Where is the "start and end position in column 2" in file 2?

ADD REPLYlink written 4.2 years ago by geek_y11k

start and end position are same in file 2..and based on file 1 ( start and end ) . for exampl file 1 chr start end
CcLg01 100 800 Then this position should be used to extract the information in file 2 (range of column 2 from 100-800).

And out put should be CcLg01 100 800 sum of column 7 sum of column8

Thank You

ADD REPLYlink written 4.2 years ago by alim.hcu10

I still don't understand where start and stop in file 2 is. Column 2 seems to only hold a single number (e.g. only start or only end), but not both.

ADD REPLYlink written 4.2 years ago by jonasmst330
1

Could you based on this example add the desired output? As of now, your question is unclear and you probably won't get a helpful answer. Spend some more time constructing your post, chose a more appropriate title and you'll be more likely to get what you want.

Is there a link with Script For R or perl ?

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by WouterDeCoster45k

In file 2 column 2 is start and end. like if in file 1,

File 1

Chr         start    end    
Cclg01      1        10

File 2

 V1        V2   V3    V4       V5        V6   V7     V8    V9      V10           V11
CcLG01  1     CG     0.000   2.00    0     2       0        0         0.000         0.658
CcLG01  2     CG     0.000   1.00    0     1       0       0         0.000         0.793
CcLG01  3       CG   0.000   1.00    0     1       0        0         0.000       0.793
CcLG01  5      CG     0.000   1.00   0     1       1       1         0.000      0.793
CcLG01  8     CG     1.000   1.00    1     1        1       1        0.207       1.000
CcLG01   9     CG    1.000   1.00    1     1        1       1        0.207       1.000

My output

Chr        start      end      sum Col(V6)    Sum Col(V7)
CcLg01    1           10           2                       7
ADD REPLYlink modified 4.2 years ago by WouterDeCoster45k • written 4.2 years ago by alim.hcu10

in file 2, column 7 is a position of specific nucleotide and in file2 there is a range (start - end) and i have to sum of the value (col7, col8) of nucleotide in the range given in file1.

So my output will be

Chr start end sum.col7 sum.col.8 CcLG11 226909 229893 ? ?

So from 228909 - 229893 (range) in file2

ADD REPLYlink written 4.2 years ago by alim.hcu10

So in File 2 columns V1, V3, V4, V5, V8, V9, V10 and V11 are not required and therefore just adding noise to this question? If so, remove those. We cannot guess what you want, don't make things harder than it already is.

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by WouterDeCoster45k

File1

CcLG01    13883    0    4
CcLG01    13897    0    5
CcLG01    13912    0    6
CcLG01    13963    0    6
CcLG01    14025    0    10
CcLG01    14060    0    8
CcLG01    14110    0    2
CcLG01    14114    0    1
CcLG01    14346    0    1
CcLG01    14347    0    2
CcLG01    20277    0    7
CcLG01    20362    0    29
CcLG01    20365    0    31
CcLG01    20461    0    11
CcLG01    20468    0    14
CcLG01    20643    0    2
CcLG01    20644    0    2
CcLG01    21657    9    9
CcLG01    21730    82    82
CcLG01    21731    47    49
CcLG01    21876    96    97
CcLG01    21877    274    303
CcLG01    21938    71    86
CcLG01    21939    109    127
CcLG01    21957    66    76
CcLG01    21958    185    234
CcLG01    21965    34    78
CcLG01    21966    225    237
CcLG01    22072    17    20
CcLG01    22073    50    51
CcLG01    22157    8    9
CcLG01    22158    10    12
CcLG01    22320    19    29
CcLG01    22321    48    50
CcLG01    22331    115    122
CcLG01    22332    56    56
CcLG01    22345    134    135
CcLG01    22346    68    70
CcLG01    22350    128    136
CcLG01    22351    68    70
CcLG01    24084    1    3</p>

File2

Chr         Start              End
CcLG01    13500         14800</p>

Output

Chr          Start         end         Sum.Col.M      Sum.col.X
Cclg01      13500      14800         ?                       ?</p>
ADD REPLYlink modified 4.2 years ago by WouterDeCoster45k • written 4.2 years ago by alim.hcu10

From range 13500 to 14800 in file2 . This range should be used in file 1 to add column3 (M) and Col4(x) of file2

In file1 column 2 is position so the position 13883 to 14347 will fall in the range of 13500 to 14800.

ADD REPLYlink written 4.2 years ago by alim.hcu10

So you want to add up the values in columns 3 and 4 in file 1 for rows where the value in column 2 lies within a range as specified by start and end in file 2?

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by jonasmst330

I think we are narrowing it down...

A custom python script could do the trick, but perhaps we can get away with a bedtools or GRanges solution, too.

ADD REPLYlink written 4.2 years ago by WouterDeCoster45k

Exactly...That is the exact, i want to do.

Thank You for your time and concern.

ADD REPLYlink written 4.2 years ago by alim.hcu10

Does the value on position 13500 belong to the interval 13500 to 14800 or not? What about the value on position 14800?

Essentially: inclusive or non-inclusive intervals? :-)

ADD REPLYlink written 4.2 years ago by WouterDeCoster45k

I am getting this error. the file I used is attached.

Traceback (most recent call last):

 File "int.py", line 9, in &lt;module&gt;
    datadict = {line.split('\t')[1] : (line.split('\t')[2],
line.strip().split('\t')[3]) for line in data if not line == ""} #Create a
dictionary mapping the position to the values
  File "int.py", line 9, in &lt;dictcomp&gt;
    datadict = {line.split('\t')[1] : (line.split('\t')[2],
line.strip().split('\t')[3]) for line in data if not line == ""} #Create a
dictionary mapping the position to the values
IndexError: list index out of range


drkishor@Dr:~/Desktop$ python int.py sample.txt position.txt  > try.txt
Traceback (most recent call last):
  File "int.py", line 9, in &lt;module&gt;
    datadict = {line.split('\t')[1] : (line.split('\t')[2],
line.strip().split('\t')[3]) for line in data if not line == ""} #Create a
dictionary mapping the position to the values
  File "int.py", line 9, in &lt;dictcomp&gt;
    datadict = {line.split('\t')[1] : (line.split('\t')[2],
line.strip().split('\t')[3]) for line in data if not line == ""} #Create a
dictionary mapping the position to the values
IndexError: list index out of range</p>
ADD REPLYlink modified 4.2 years ago by WouterDeCoster45k • written 4.2 years ago by alim.hcu10

Please add your comment to the relevant post and not just randomly, this makes things rather confusing.

Let's make sure you use the arguments correctly, corresponding to your example data above you should use the script as such:

python Script2Extract.py file1 file2 > output
ADD REPLYlink written 4.2 years ago by WouterDeCoster45k

Dear,

I am still getting error "IndexError: list index out of range".

Thank You

ADD REPLYlink written 4.2 years ago by alim.hcu10

What is your column separator? Spaces or tabs?

I updated my code, please try again. It should identify the line causing this error.

ADD REPLYlink written 4.2 years ago by WouterDeCoster45k

drkishor@Dr:~/Desktop$ python extract.py pos_sam.txt Final_sam.txt INPUT ERROR AT FOLLOWING LINE: CcLG01 21694 22357

Plz find the attached file which i have used

ADD REPLYlink written 4.2 years ago by alim.hcu10

There is no attached file. Switch the arguments.

ADD REPLYlink written 4.2 years ago by WouterDeCoster45k

Hello alim.hcu!

We believe that this post does not fit the main topic of this site.

Not a bioinformatics question. The question has several further issues, after a long back-and-forth discussion it remains totally unclear what the desired output is, the example data given is possibly inadequate etc.

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.

Cheers!

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Michael Dondrup48k

But i don,t know, who has posted this post. i think it may be a spam.

Thank You

ADD REPLYlink written 4.1 years ago by alim.hcu10

Yes, there was a spam post here and I removed it.

But the arguments of Michael have nothing do with spam:

The question has several further issues, after a long back-and-forth discussion it remains totally unclear what the desired output is, the example data given is possibly inadequate etc.

ADD REPLYlink written 4.1 years ago by WouterDeCoster45k
2
gravatar for WouterDeCoster
4.2 years ago by
Belgium
WouterDeCoster45k wrote:

See also my question above:

Does the value on position 13500 belong to the interval 13500 to 14800 or not? What about the value on position 14800? Essentially: inclusive or non-inclusive intervals? :-)

I don't have access to your data, so you will have to do the testing. Please give feedback.

ADD COMMENTlink written 4.2 years ago by WouterDeCoster45k
Please log in to add an answer.
The thread is closed. No new answers may be added.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1124 users visited in the last hour
_