I am trying to learn numerical analysis. I am following this articles - http://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html
My data looks like this :
date hr_of_day vals
2014-05-01 0 72
2014-05-01 1 127
2014-05-01 2 277
2014-05-01 3 411
2014-05-01 4 666
2014-05-01 5 912
2014-05-01 6 1164
2014-05-01 7 1119
2014-05-01 8 951
2014-05-01 9 929
2014-05-01 10 942
2014-05-01 11 968
2014-05-01 12 856
2014-05-01 13 835
2014-05-01 14 885
2014-05-01 15 945
2014-05-01 16 924
2014-05-01 17 914
2014-05-01 18 744
2014-05-01 19 377
2014-05-01 20 219
2014-05-01 21 106
2014-05-01 22 56
2014-05-01 23 43
2014-05-02 0 61
For given date and and hr, I want to predict the vals and identify pattern.
I have written this code :
import pandas as pd
from sklearn import datasets
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
# read the data in
Train = pd.read_csv("data_scientist_assignment.tsv")
#print df.head()
x1=["date", "hr_of_day", "vals"]
#print x1
#print df[x1]
test=pd.read_csv("test.tsv")
model = LogisticRegression()
model.fit(Train[x1], Train["vals"])
print(model)
print model.score(Train[x1], Train["vals"])
print model.predict_proba(test[x1])
I am getting thsi error:
KeyError: "['date' 'hr_of_day' 'vals'] not in index"
What is the issue. Is there any better way to do this?
test file format:
date hr_of_day
2014-05-01 0
2014-05-01 1
2014-05-01 2
2014-05-01 3
2014-05-01 4
2014-05-01 5
2014-05-01 6
2014-05-01 7
Full error stake:
Traceback (most recent call last):
File "socratis.py", line 16, in <module>
model.fit(Train[x1], Train["vals"])
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 1986, in __getitem__
return self._getitem_array(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 2030, in _getitem_array
indexer = self.ix._convert_to_indexer(key, axis=1)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1210, in _convert_to_indexer
raise KeyError('%s not in index' % objarr[mask])
KeyError: "['date' 'hr_of_day' 'vals'] not in index"
I suggest providing sep='\t' parameter when reading TSV:
Train = pd.read_csv("data_scientist_assignment.tsv", sep='\t') # use TAB as column separator
When you fix this, there is another problem in the queue: ValueError: could not convert string to float: '2014-09-13'
This is because linear regression wants numeric features and column date is a string type.
You can introduce new column timestamp by converting the date to timestamp (seconds since epoch) and use it as a feature:
Train['timestamp'] = pd.to_datetime(Train['date']).apply(lambda a: a.timestamp())
x1=["timestamp", "hr_of_day", "vals"]
From a ML perspective, you shouldn't use your target value vals as an input feature. You should also consider representing the date as individual features: day, mont, year; or day-of-week, it depends on what you want to model.
Related
I have the following minimal example data (in reality 100's of groups) in range A1:P9 (same data in range A14:A22):
With Sample A1:AR9:
2
61
219
2
4
2
:
61
219
26
26
26
94
21
33
4
26
26
26
94
2
2
:
154
26
40
19
3
2
21
33
14
1
2
3
:
87
39
54
38
26
32
38
26
32
87
39
54
38
26
23
23
4
6
28
2
154
26
2
2
40
19
14
87
39
54
38
26
32
38
26
32
87
39
54
38
26
1
23
2
23
4
4
3
6
20
28
Or Sample A14:AQ22:
2
61
219
2
:
61
219
4
:
26
26
26
94
2
:
21
33
4
26
26
26
94
2
:
154
26
2
:
40
19
3
2
21
33
14
:
87
39
54
38
26
32
38
26
32
87
39
54
38
26
1
:
23
2
:
23
4
:
3
6
20
2
154
26
2
2
40
19
14
87
39
54
38
26
32
38
26
32
87
39
54
38
26
1
23
2
23
4
4
3
6
20
28
I need the output as shown in range Q1:AR3 or as in range Q14:AQ16.
Basically, at each group delimited/inbetween values in Column A, I would need:
The intemediary adjacent values in Column B to be transposed horizontally
And the adjacent content of Columns C to P (14 Columns, at least) to be "joined" together horizontaly an sequencialy "per group", including the content of the delimiter's row (in Column A).
As a bonus it would be really nice to have the Transposed data followed by a :, and each sub Content of Columns C to P to be also separated by a | (as shown in screenshot Q1:AR3 or Q14:AR16).
(Or if it's more feasible, alternatively, the simpler to read 2nd model as in A14:AQ22).
I have a really hard time putting together a formula to come to the expected result.
All I could think of was:
Transposing Column B's content by getting the rows of the adjacent Cells with values in column A,
Concatenating with the Column letter,
Duplicating it in a new column, and Filtering out the blank intermediary cells,
Then shifting the duplicated column 1 cell up,
Then concatenating within a TRANSPOSE formula to get the range of the groups,
Then finally transposing all the groups from Columns B in a new Colum
(very convoluted but I couldn't find better way).
To get to that input:
=TRANSPOSE(B1:B3)
=TRANSPOSE(B4:B5)
=TRANSPOSE(B7:B9)
That was already a very manual and error prone process, and still I could not successfully think of how to do the remaining content joining of Column C to P in a formula.
I tested the following approach but it's not working and would be very tedious process to fix to go and to implement on large datasets:
=TRANSPOSE(B1:B3)&": "&JOIN( " | " , FILTER(C1:P1, NOT(C2:P2 = "") ))&JOIN( " | " , FILTER(C2:P2, NOT(C2:P2 = "") ))&JOIN( " | " , FILTER(C43:P3, NOT(C3:P3 = "") ))
=TRANSPOSE(B4:B5)&": "&JOIN( " | " , FILTER(C4:P4, NOT(C4:P4 = "") ))&JOIN( " | " , FILTER(C5:P5, NOT(C5:P5 = "") ))
=TRANSPOSE(B6:B9)&": "&JOIN( " | " , FILTER(C6:P6, NOT(C6:P6 = "") ))&JOIN( " | " , FILTER(C7:P7, NOT(C7:P7 = "") ))&JOIN( " | " , FILTER(C8:P8, NOT(C8:P8 = "") ))&JOIN( " | " , FILTER(C8:P8, NOT(C9:P9 = "") ))
What better approach to favor toward the expected result? Preferably with a Formula, or if not possible with a script.
Any help is greatly appreciated.
For Sample 1 try this out:
=LAMBDA(norm,MAP(UNIQUE(norm),LAMBDA(ζ,{TRANSPOSE(FILTER(B1:B9,norm=ζ)),":",SPLIT(BYROW(TRANSPOSE(FILTER(BYROW(C1:P9,LAMBDA(r,TEXTJOIN("ζ",1,r))),norm=ζ)),LAMBDA(rr,TEXTJOIN("γ|γ",1,rr))),"ζγ")})))(SORT(SCAN(,SORT(A1:A9,ROW(A1:A9),),LAMBDA(a,c,IF(c="",a,c))),ROW(A1:A9),))
I am trying to play with rdflib and a (my) user defined vocabulary (name: ODE).
To do that I have generated a class namespace/_ODE.py derived from DefinedNamespace:
1 from rdflib.term import URIRef
2 from rdflib.namespace import DefinedNamespace, Namespace
3
4
5 class ODE(DefinedNamespace):
6 """
7 DESCRIPTION_EDIT_ME_!
8
9 Generated from: SOURCE_RDF_FILE_EDIT_ME_!
10 Date: 2022-05-02 08:38:55.619901
11 """
12
13 _fail = True
14
15 Function: URIRef
16 Equation: URIRef
17 hasDerivative: URIRef
18 Polynomial: URIRef
19 Ode: URIRef
20
21 _NS = Namespace("ode#")
22
As all the new "classes" of the ODE vocabulary are a specialization of the class "Seq" I have created the module rdflib/ode.py:
1 from rdflib import Seq
2 from rdflib.namespace import RDF,ODE,MATH
3
4 __all__ = ["Function", "Equation","Polynomial","Ode"]
5
6
7 class Ode(Seq):
8 def __init__(self, graph, uri, seq=[], rtype="Ode"):
9 """Creates a Container
10
11 :param graph: a Graph instance
12 :param uri: URI or Blank Node of the Container
13 :param seq: the elements of the Container
14 :param rtype: the type of Container, one of "Bag", "Seq" or "Alt"
15 """
16
17 self.graph = graph
18 self.uri = uri or BNode()
19 self._len = 0
20 self._rtype = rtype # rdf:Bag or rdf:Seq or rdf:Alt
21
22 self.append_multiple(seq)
23
24 # adding triple corresponding to container type
25 self.graph.add((self.uri, RDF.type, ODE[self._rtype]))
26
27 class Function(Ode):
28 def __init__(self, graph, uri, seq=[]):
29 Ode.__init__(self, graph, uri, seq, "Function")
30
31
32 class Equation(Ode):
33 def __init__(self, graph, uri, seq=[]):
34 Ode.__init__(self, graph, uri, seq, "Equation")
35
36 class Polynomial(Ode):
37 def __init__(self, graph, uri, seq=[]):
38 Ode.__init__(self, graph, uri, seq, "Polynomial")
With these two classes I can generate a RDF file in a declarative way.
For example we can create the Function c(t):
1 from rdflib import Graph, URIRef, RDF, BNode, RDFS, Literal, Seq, Bag, Function, Equation, Times, Minus, Polynomial, Ode
2 from rdflib.namespace import ODE, MATH
3
4 # the time t
5 t = BNode("t")
6 graph.add((t,RDFS.label,Literal("t")))
7
8 c_of_t_label = BNode("c")
9 graph.add((c_of_t_label,RDFS.label,Literal("c")))
10 c_of_t_bn = BNode("c_of_t")
11
12 Function(graph,c_of_t_bn,[c_of_t_label,t])
And we obtain the following RDF:
_:c rdfs:label "c" .
_:t rdfs:label "t" .
_:c_of_t a ode:Function ;
rdf:_1 _:c ;
rdf:_2 _:t .
So far, so good. Now I want to execute a SPARQL query on this rdf to retrieve the function.
1 import rdflib
2
3 from rdflib import Graph, URIRef, RDF, BNode, RDFS, Literal, Seq, Bag, Function, Equation, Times, Minus, Polynomial, Ode
4 from rdflib.namespace import ODE, MATH
5
6 def main():
7 g = rdflib.Graph()
8 g.parse("ode_spe", format="ttl")
9
10
11 function = ODE.Function
12
13 query_test= "SELECT ?e WHERE {?e rdf:type ode:Function . }"
14 qres = g.query(query_test)
15
16 print (len(qres))
17 if __name__ == "__main__":
18 main()
But I have no results.
I probably do not do the right thing with ode:Function.
I have two questions:
Is it the right way to add a user defined vocabulary ?
And what can I do to retrieve the function with a SPARQL query
Thank for your help.
Olivier
My eye was drawn to SELECT ?e WHERE {?e rdf:type ode:Function . }. Check that ode is known by the graph. Either add a PREFIX spec in the SPARQL or an initNs keyword arg in the g.query invocation. And/or use g.namespace_manager to bind "ode" to ODE.
I'm querying data from different shards and used EXPLAIN to check how many series are being fetched for that particular date range.
> SHOW SHARDS
.
.
658 mydb autogen 658 2019-07-22T00:00:00Z 2019-07-29T00:00:00Z 2020-07-27T00:00:00Z
676 mydb autogen 676 2019-07-29T00:00:00Z 2019-08-05T00:00:00Z 2020-08-03T00:00:00Z
.
.
Executing EXPLAIN for data from shard 658 and it's giving expected result in terms of number of series. SensorId is only tag key and as date range fall into only shard it's giving NUMBER OF SERIES: 1
> EXPLAIN select "kWh" from Reading where (SensorId =~ /^1186$/) AND time >= '2019-07-27 00:00:00' AND time <= '2019-07-28 00:00:00' limit 10;
QUERY PLAN
----------
EXPRESSION: <nil>
AUXILIARY FIELDS: "kWh"::float
NUMBER OF SHARDS: 1
NUMBER OF SERIES: 1
CACHED VALUES: 0
NUMBER OF FILES: 2
NUMBER OF BLOCKS: 4
SIZE OF BLOCKS: 32482
But when I run the same query on date range that falls into shard 676, number of series is 13140 instead of just one.
> EXPLAIN select "kWh" from Reading where (SensorId =~ /^1186$/) AND time >= '2019-07-29 00:00:00' AND time < '2019-07-30 00:00:00';
QUERY PLAN
----------
EXPRESSION: <nil>
AUXILIARY FIELDS: "kWh"::float
NUMBER OF SHARDS: 1
NUMBER OF SERIES: 13140
CACHED VALUES: 0
NUMBER OF FILES: 11426
NUMBER OF BLOCKS: 23561
SIZE OF BLOCKS: 108031642
Environment info:
System info: Linux 4.4.0-1087-aws x86_64
InfluxDB version: InfluxDB v1.7.6 (git: 1.7 01c8dd4)
Update - 1
On checking field cardinality, I observed a spike in RAM.
> SHOW FIELD KEY CARDINALITY
Update - 2
I've rebuilt the indexes, but the cardinality is still high.
Update - 3
I found out that shard has "SensorId" as tag as well as field that causing high cardinality when querying with the "SensorId" filter.
> SELECT COUNT("SensorId") from Reading GROUP BY "SensorId";
name: Reading
tags: SensorId=
time count
---- -----
1970-01-01T00:00:00Z 40
But when I'm checking tag values with key 'SensorId', it's not showing empty string that present in the above query.
> show tag values with key = "SensorId"
name: Reading
key value
--- -----
SensorId 10034
SensorId 10037
SensorId 10038
SensorId 10039
SensorId 10040
SensorId 10041
.
.
.
SensorId 9938
SensorId 9939
SensorId 9941
SensorId 9942
SensorId 9944
SensorId 9949
Update - 4
Inspected data using influx_inspect dumptsm and re-validated that null tag values are present
$ influx_inspect dumptsm -index -filter-key "" /var/lib/influxdb/data/mydb/autogen/235/000008442-000000013.tsm
Index:
Pos Min Time Max Time Ofs Size Key Field
1 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 5 103 Reading 1001
2 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 108 275 Reading 2001
3 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 383 248 Reading 2002
4 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 631 278 Reading 2003
5 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 909 278 Reading 2004
6 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 1187 184 Reading 2005
7 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 1371 103 Reading 2006
8 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 1474 250 Reading 2007
9 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 1724 103 Reading 2008
10 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 1827 275 Reading 2012
11 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 2102 416 Reading 2101
12 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 2518 103 Reading 2692
13 2019-08-01T01:46:31Z 2019-08-01T17:42:03Z 2621 101 Reading SensorId
14 2019-07-29T00:00:05Z 2019-07-29T05:31:07Z 2722 1569 Reading,SensorId=10034 2005
15 2019-07-29T05:31:26Z 2019-07-29T11:03:54Z 4291 1467 Reading,SensorId=10034 2005
16 2019-07-29T11:04:14Z 2019-07-29T17:10:16Z 5758 1785 Reading,SensorId=10034 2005
I basically want the same thing as this OP:
Is there a J idiom for adding to a list until a certain condition is met?
But I cant get the answers to work with OP's function or my own.
I will rephrase the question and write about the answers at the bottom.
I am trying to create a function that will return a list of fibonacci numbers less than 2.000.000. (without writing "while" inside the function).
Here is what i have tried:
First, i picked a way to culculate fibonacci numbers from this site:
https://code.jsoftware.com/wiki/Essays/Fibonacci_Sequence
fib =: (i. +/ .! i.#-)"0
echo fib i.10
0 1 1 2 3 5 8 13 21 34
Then I made an arbitrary list I knew was larger than what I needed. :
fiblist =: (fib i.40) NB. THIS IS A BAD SOLUTION!
Finally, I removed the numbers that were greater than what I needed:
result =: (fiblist < 2e6) # fiblist
echo result
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1.34627e6
This gets the right result, but is there a way to avoid using some arbitrary number like
40 in "fib i.40" ?
I would like to write a function, such that "func 2e6" returns the list of fibonacci numbers below 2.000.000. (without writing "while" inside the function).
echo func 2e6
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1.34627e6
here are the answers from the other question:
first answer:
2 *^:(100&>#:])^:_"0 (1 3 5 7 9 11)
128 192 160 112 144 176
second answer:
+:^:(100&>)^:(<_) ] 3
3 6 12 24 48 96 192
As I understand it, I just need to replace the functions used in the answers, but i dont see how
that can work. For example, if I try:
echo (, [: +/ _2&{.)^:(100&>#:])^:_ i.2
I get an error.
I approached it this way. First I want to have a way of generating the nth Fibonacci number, and I used f0b from your link to the Jsoftware Essays.
f0b=: (-&2 +&$: -&1) ^: (1&<) M.
Once I had that I just want to put it into a verb that will check to see if the result of f0b is less than a certain amount (I used 1000) and if it was then I incremented the input and went through the process again. This is the ($:#:>:) part. $: is Self-Reference. The right 0 argument is the starting point for generating the sequence.
($:#:>: ^: (1000 > f0b)) 0
17
This tells me that the 17th Fibonacci number is the largest one less than my limit. I use that information to generate the Fibonacci numbers by applying f0b to each item in i. ($:#:>: ^: (1000 > f0b)) 0 by using rank 0 (fob"0)
f0b"0 i. ($:#:>: ^: (1000 > f0b)) 0
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987
In your case you wanted the ones under 2000000
f0b"0 i. ($:#:>: ^: (2000000 > f0b)) 0
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1346269
... and then I realized that you wanted a verb to be able to answer your original question. I went with dyadic where the left argument is the limit and the right argument generates the sequence. Same idea but I was able to make use of some hooks when I went to the tacit form. (> f0b) checks if the result of f0b is under the limit and ($: >:) increments the right argument while allowing the left argument to remain for $:
2000000 (($: >:) ^: (> f0b)) 0
32
fnum=: (($: >:) ^: (> f0b))
2000000 fnum 0
32
f0b"0 i. 2000000 fnum 0
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1346269
I have little doubt that others will come up with better solutions, but this is what I cobbled together tonight.
I want to read the following data in SPSS :
ID Age Sex GPA
----------------
1 17 M 5
2 16 F 5
3 17 F 4.75
4 18 M 5
5 19 M 4.5
My attempt:
DATA LIST / ID 1 AGE 2-3 SEX 4(A) GPA 5-8.
BEGIN DATA
117M5
216F5
317F4.75
418M5
519M4.5
END DATA.
LIST.
But the output is
ID AGE SEX GPA
---------------
1 17 M 5
2 16 F 5
3 17 F 5
4 18 M 5
5 19 M 5
How can I get the decimals?
You data is as expected, it is just the format of the GPA variable was incorrectly set to not have any decimals. You can simply use whats below to set it to show the decimals.
FORMATS GPA (F3.2).
Alternatively you can also try this
DATA LIST / ID 1 AGE 2-3 SEX 4(A) GPA 5-7(F,2).
BEGIN DATA
117M500
317F475
END DATA.
LIST.