How to pass different values to Pipeline Parameters - machine-learning

suppose if i am doing hyper parameter tuning to one of my model, lets say, i am using AdaBoostClassifier() and want to pass different base_estimator, so i pass SVC & DecisionTreeClassifier as estimator
_parameters=[
{
'mdl':[AdaBoostClassifier(random_state=23)],
'mdl__learning_rate':np.linspace(0,1,20),
'mdl__base_estimator':[SVC(),DecisionTreeClassifier()]
}
]
now, i want to pass different values to ccp_alpha of DecisionTreeClassifier, something like this
'mdl__base_estimator':[LinearRegression(),DecisionTreeClassifier(ccp_alpha=[0.1,0.2,0.3,0.4])]
how can i do that, i tried passing it like this, but it is not working, here is my entire code
pipeline=Pipeline(
[
('scal',StandardScaler()),
('mdl','passthrough')
]
)
_parameters=[
{
'mdl':[DecisionTreeClassifier(random_state=42)] ,
'mdl__max_depth':np.linspace(2,30,2),
'mdl__min_samples_split':np.linspace(1,10,1),
'mdl__max_features':np.linspace(1,100,1),
'mdl__ccp_alpha':np.linspace(0,1,10)
}
,{
'mdl':[AdaBoostClassifier(random_state=23)],
'mdl__learning_rate':np.linspace(0,1,20),
'mdl__base_estimator':[SVC(),DecisionTreeClassifier(ccp_alpha=[0.3,0.4,0.5,0.7])]
}
]
grid_search=GridSearchCV(_pipeline,_parameters,cv=3,n_jobs=-1,scoring='f1')
grid_search.fit(x,y
)

This kind of splitting is why param_grid can be a list of dicts, as in your outer split; but it cannot easily handle the nested disjunction you have. Two approaches come to mind.
More disjoint grids:
_parameters=[
{
'mdl': [DecisionTreeClassifier(random_state=42)],
'mdl__max_depth': np.linspace(2,30,2),
'mdl__min_samples_split': np.linspace(1,10,1),
'mdl__max_features': np.linspace(1,100,1),
'mdl__ccp_alpha': np.linspace(0,1,10),
},
{
'mdl': [AdaBoostClassifier(random_state=23)],
'mdl__learning_rate': np.linspace(0,1,20),
'mdl__base_estimator': [SVC()],
},
{
'mdl': [AdaBoostClassifier(random_state=23)],
'mdl__learning_rate': np.linspace(0,1,20),
'mdl__base_estimator': [DecisionTreeClassifier()],
'mdl__base_estimator__ccp_alpha': [0.3,0.4,0.5,0.7],
},
]
Or list comprehension:
_parameters=[
{
'mdl': [DecisionTreeClassifier(random_state=42)],
'mdl__max_depth': np.linspace(2,30,2),
'mdl__min_samples_split': np.linspace(1,10,1),
'mdl__max_features': np.linspace(1,100,1),
'mdl__ccp_alpha': np.linspace(0,1,10),
},
{
'mdl': [AdaBoostClassifier(random_state=23)],
'mdl__learning_rate': np.linspace(0,1,20),
'mdl__base_estimator': [SVC()] + [DecisionTreeClassifier(ccp_alpha=a) for a in [0.3,0.4,0.5,0.7]],
},
]

Related

Finding the mean point in tiling space?

I have a collection of points in tiling 2D space (I believe its called toroidal geometry/space), and I want to find their mean:
The basic approach would be to just take their mean 'locally' where you would just treat the space as non-tiling. Looking at the example, I'd guess that that would be somewhere in about the middle. However, looking at the extended example, I'd say that the middle is probably one of the worst representations of the data.
I'd say the objective is to find a location where the total variation from the mean is at a minimum
One potential method would be to try all combinations of points in each of the 9 neighbours, and then see which one has the lowest variance, but that becomes extremely inefficient very quickly:
Big O = O(8^n)
I believe it could probably be made more efficient by doing something like treating the x and y independently, but that would only reduce it to O(5^n), so still not manageable.
Perhaps hill-climbing might work? Where I have a random point, and then calculate the lowest possible variance for each point, then make some random adjustments and test again reverting if the variance decreases, I then repeat this until I reach a seemingly optimal value.
Is there a better method? Or maybe some sort of heuristic 'good enough' method?
as of my understanding, you are trying to find the center of mass.
to do so (for each tile) you have to find the sum of each positions multiplied by its weight (assume it is 1 because they are just identical points positioned differently) then divide this by the sum of the weights (which is the number of points in this example as weight = 1). The formula
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS_HTML-full"></script> <script type="text/x-mathjax-config"> MathJax.Hub.Config({"HTML-CSS": { preferredFont: "TeX", availableFonts:["STIX","TeX"], linebreaks: { automatic:true }, EqnChunk:(MathJax.Hub.Browser.isMobile ? 10 : 50) }, tex2jax: { inlineMath: [ ["$", "$"], ["\\\\(","\\\\)"] ], displayMath: [ ["$$","$$"], ["\\[", "\\]"] ], processEscapes: true, ignoreClass: "tex2jax_ignore|dno" }, TeX: { noUndefined: { attributes: { mathcolor: "red", mathbackground: "#FFEEEE", mathsize: "90%" } }, Macros: { href: "{}" } }, messageStyle: "none" }); </script>
$$G\left( x,y \right) =\frac{\sum_{i=1}^n{m_i\cdot p_i\left( x_i\,\,,y_i \right)}}{\sum_{i=1}^n{m_i}}$$
in our example:
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS_HTML-full"></script> <script type="text/x-mathjax-config"> MathJax.Hub.Config({"HTML-CSS": { preferredFont: "TeX", availableFonts:["STIX","TeX"], linebreaks: { automatic:true }, EqnChunk:(MathJax.Hub.Browser.isMobile ? 10 : 50) }, tex2jax: { inlineMath: [ ["$", "$"], ["\\\\(","\\\\)"] ], displayMath: [ ["$$","$$"], ["\\[", "\\]"] ], processEscapes: true, ignoreClass: "tex2jax_ignore|dno" }, TeX: { noUndefined: { attributes: { mathcolor: "red", mathbackground: "#FFEEEE", mathsize: "90%" } }, Macros: { href: "{}" } }, messageStyle: "none" }); </script>
$$G\left( x,y \right) =\frac{\sum_{i=1}^n{p_i\left( x_i\,\,,y_i \right)}}{n}$$
here is a python implementation :
#assuming l is a list of 2D tuples as follow: [(x1,y1),(x2,y2),...]
def findCenter(l):
cx,cy = 0,0
for p in l:
cx += p[0]
cy += p[1]
return (cx / len(l), cy / len(l))
# the result is a tuple of non-integer, if you need
# an integer result use: return (cx // len(l),cy // len(l))
# remove the outer parenthesis to take result as 2 separate values
as for multiple tiles you can calculate the center for each tile then the center for the centers, or treat all points from multiple tiles as points from one big tile and calculate the center of all of them.

Is there a way to use OCR to extract specific data from a CAD technical drawing?

I'm trying to use OCR to extract only the base dimensions of a CAD model, but there are other associative dimensions that I don't need (like angles, length from baseline to hole, etc). Here is an example of a technical drawing. (The numbers in red circles are the base dimensions, the rest in purple highlights are the ones to ignore.) How can I tell my program to extract only the base dimensions (the height, length, and width of a block before it goes through the CNC)?
The issue is that the drawings I get are not in a specific format, so I can't tell the OCR where the dimensions are. It has to figure out on its own contextually.
Should I train the program through machine learning by running several iterations and correcting it? If so, what methods are there? The only thing I can think of are Opencv cascade classifiers.
Or are there other methods to solving this problem?
Sorry for the long post. Thanks.
I feel you... it's a very tricky problem, and we spent the last 3 years finding a solution for it. Forgive me for mentioning the own solution, but it will certainly solve your problem: pip install werk24
from werk24 import Hook, W24AskVariantMeasures
from werk24.models.techread import W24TechreadMessage
from werk24.utils import w24_read_sync
from . import get_drawing_bytes # define your own
def recv_measures(message: W24TechreadMessage) -> None:
for cur_measure in message.payload_dict.get('measures'):
print(cur_measure)
if __name__ == "__main__":
# define what information you want to receive from the API
# and what shall be done when the info is available.
hooks = [Hook(ask=W24AskVariantMeasures(), function=recv_measures)]
# submit the request to the Werk24 API
w24_read_sync(get_drawing_bytes(), hooks)
In your example it will return for example the following measure
{
"position": <STRIPPED>
"label": {
"blurb": "ø30 H7 +0.0210/0",
"quantity": 1,
"size": {
"blurb": "30",
"size_type":" "DIAMETER",
"nominal_size": "30.0",
},
"unit": "MILLIMETER",
"size_tolerance": {
"toleration_type": "FIT_SIZE_ISO",
"blurb": "H7",
"deviation_lower": "0.0",
"deviation_upper": "0.0210",
"fundamental_deviation": "H",
"tolerance_grade": {
"grade":7,
"warnings":[]
},
"thread": null,
"chamfer": null,
"depth":null,
"test_dimension": null,
},
"warnings": [],
"confidence": 0.98810
}
or for a GD&T
{
"position": <STRIPPED>,
"frame": {
"blurb": "[⟂|0.05|A]",
"characteristic": "⟂",
"zone_shape": null,
"zone_value": {
"blurb": "0.05",
"width_min": 0.05,
"width_max": null,
"extend_quantity": null,
"extend_shape": null,
"extend": null,
"extend_angle": null
},
"zone_combinations": [],
"zone_offset": null,
"zone_constraint": null,
"feature_filter": null,
"feature_associated": null,
"feature_derived": null,
"reference_association": null,
"reference_parameter": null,
"material_condition": null,
"state": null,
"data": [
{
"blurb": "A"
}
]
}
}
Check the documentation on Werk24 for details.
Although a managed offering, Mixpeek is one free option:
pip install mixpeek
from mixpeek import Mixpeek
mix = Mixpeek(
api_key="my-api-key"
)
mix.upload(file_name="design_spec.dwg", file_path="s3://design_spec_1.dwg")
This /upload endpoint will extract the contents of your DWG file, then when you search for terms it will include the file_path so you can render it in your HTML.
Behind the scenes it uses the open source LibreDWG library to run a number of AutoCAD native commands such as DATAEXTRACTION.
Now you can search for a term and the relevant DWG file (in addition to the context in which it exists) will be returned:
mix.search(query="retainer", include_context=True)
[
{
"file_id": "6377c98b3c4f239f17663d79",
"filename": "design_spec.dwg",
"context": [
{
"texts": [
{
"type": "text",
"value": "DV-34-"
},
{
"type": "hit",
"value": "RETAINER"
},
{
"type": "text",
"value": "."
}
]
}
],
"importance": "100%",
"static_file_url": "s3://design_spec_1.dwg"
}
]
More documentation here: https://docs.mixpeek.com/

parsing JSON file using telegraf input plugin : unexpected Output

I’m new to telegraf and influxdb, and currently looking forward to exploring telegraf, but unfortunetly, I have some difficulty getting started, I will try to explain my problem below:
Objectif: parsing JSON file using telegraf input plugin.
Input : https://wetransfer.com/downloads/0abf7c609d000a7c9300dc20ee0f565120200624164841/ab22bf ( JSON file used )
The input json file is a repetition of the same structure that starts from params and ends at it.
you find below the main part of the input file :
{
"events":[
{
"params":[
{
"name":"element_type",
"value":"Home_Menu"
},
{
"name":"element_id",
"value":""
},
{
"name":"uuid",
"value":"981CD435-E6BC-01E6-4FDC-B57B5CFA9824"
},
{
"name":"section_label",
"value":"HOME"
},
{
"name":"element_label",
"value":""
}
],
"libVersion":"4.2.5",
"context":{
"locale":"ro-RO",
"country":"RO",
"application_name":"spresso",
"application_version":"2.1.8",
"model":"iPhone11,8",
"os_version":"13.5",
"platform":"iOS",
"application_lang_market":"ro_RO",
"platform_width":"320",
"device_type":"mobile",
"platform_height":"480"
},
"date":"2020-05-31T09:38:55.087+03:00",
"ssid":"spresso",
"type":"MOBILEPAGELOAD",
"user":{
"anonymousid":"6BC6DC89-EEDA-4EB6-B6AD-A213A65941AF",
"userid":"2398839"
},
"reception_date":"2020-06-01T03:02:49.613Z",
"event_version":"v1"
}
Issue : Following the documentation, I tried to define a simple telegraf.conf file as below:
[[outputs.influxdb_v2]]
…
[[inputs.file]]
files = ["/home/mouhcine/json/file.json"]
json_name_key = "My_json"
#... Listing all the string fields in the json.(I put only these for simplicity reason).
json_string_fields = ["ssid","type","userid","name","value","country","model"]
data_format = "json"
json_query= "events"
Basically declaring string fields in the telegraf.conf file would do it, but I couldn’t get all the fields that are subset in the json file, like for example what’s inside ( params or context ).
So finally, I get to parse fields with the same level of hierarchy as ssid, type, libVersion, but not the ones inside ( params, context, user).
Output : Screen2 ( attachment ).
OUTPUT
By curiosity, I tried to test the documentation’s example, in order to verify whether I get the same expected result, and the answer is no :/, I don’t get to parse the string field in the subset of the file.
The doc’s example below:
Input :
{
"a": 5,
"b": {
"c": 6,
"my_field": "description"
},
"my_tag_1": "foo",
"name": "my_json"
}
telegraf.conf
[[outputs.influxdb_v2]]
…
[[inputs.file]]
files = ["/home/mouhcine/json/influx.json"]
json_name_key = "name"
tag_keys = ["my_tag_1"]
json_string_fields = ["my_field"]
data_format = "json"
Expected Output : my_json,my_tag_1=foo a=5,b_c=6,my_field="description"
The Result I get : "my_field" is missing.
Output: Screen 1 ( attachement ).
OUTPUT
By the way, I use the influxdb cloud 2, and I apologize for the long description of this little problem, I would appreciate some help please :), Thank you so much in advance.

What are the GLTF animations sampler input/output values?

I am reading the specification, but I can not understand the properties of the sampler.
This is the animation that I have
"animations" : [
{
"channels" : [
{
"sampler" : 0,
"target" : {
"node" : 0,
"path" : "translation"
}
}
],
"name" : "00001_2780.datAction",
"samplers" : [
{
"input" : 9,
"interpolation" : "CUBICSPLINE",
"output" : 10
}
]
},
{
"channels" : [
{
"sampler" : 0,
"target" : {
"node" : 1,
"path" : "translation"
}
}
],
"name" : "00002_2780.datAction",
"samplers" : [
{
"input" : 9,
"interpolation" : "CUBICSPLINE",
"output" : 11
}
]
}
],
What I can not understand is what are the values 9 and 10 for the first sample and 9 and 11 for the second
All that we have in the specification is
https://github.com/KhronosGroup/glTF/tree/master/specification/2.0#animations
Each of the animation's samplers defines the input/output pair: a set of floating point scalar values representing linear time in seconds; and a set of vectors or scalars representing animated property.
And this makes it more unclear to me.
Is there a more detailed explanation about what input/output values are and what they represent. What will happen for example if I change the input from 9 to 99 or to 9.9 or to 0.9 or to 0. How will this change the animation?
Thanks
The numbers 9 and 10 here are glTF accessor index ID values. If you decode accessor index 9, you'll find the list of times for each of the keyframes of the animation. If you decode accessor 10, normally you would expect to find the list of values for the keyframes. But since this is CUBICSPLINE, accessor 10 will contain the in-tangent, value, and out-tangent for each keyframe.
One way to investigate glTF files like this is to use the glTF Tools extension for VSCode. You can right-click the input or output value and choose Go To Definition to get to the accessor in question, and choose Go To Definition again to decode it. (Disclaimer, I'm a contributor to glTF Tools).

Combine two Groovy maps and get differences [duplicate]

This question already has an answer here:
Compare two maps and find differences using Groovy or Java
(1 answer)
Closed 6 years ago.
I have two maps and I would like to combine this maps into new map with the differences. I mean with differences is if startDate or appId are different for that cuInfo and service.
Notice: The attributes cuInfo and service are unique in both maps.
Map 1:
[
[name:"Apple",cuInfo:"T12",service:"3",startDate:"14-02-16 10:00",appId:"G12351"],
[name:"Apple",cuInfo:"T13",service:"3",startDate:"14-01-16 13:00",appId:"G12352"],
[name:"Apple",cuInfo:"T16",service:"3",startDate:"14-01-16 13:00",appId:"G12353"],
[name:"Google",cuInfo:"T14",service:"9",startDate:"10-01-16 11:20",appId:"G12301"],
[name:"Microsoft",cuInfo:"T15",service:"10",startDate:"26-02-16 10:20",appId:"G12999"]
]
Map 2:
[
[cuInfo:"T12",service:"3",startDate:"14-01-16 13:22",appId:"G12355"],
[cuInfo:"T13",service:"3",startDate:"12-02-16 13:00",appId:"G12356"],
[cuInfo:"T14",service:"9",startDate:"10-01-16 11:20",appId:"G12300"],
[cuInfo:"T15",service:"10",startDate:"26-02-16 10:20",appId:"G12999"]
]
I would like to get map below as final result:
[
[name:"Apple",cuInfo:"T12",service:"3",startDate:"14-02-16 10:00",appId:"G12351",startDate2:"14-01-16 13:22",appId2:"G12355"],
[name:"Apple",cuInfo:"T13",service:"3",startDate:"14-01-16 13:00",appId:"G12352",startDate2:"12-02-16 13:00",appId2:"G12356"],
[name:"Google",cuInfo:"T14",service:"9",startDate:"10-01-16 11:20",appId:"G12301",startDate2:"10-01-16 11:20",appId2:"G12300"]
]
I tried the code below but it gives not what I want:
def total = [list1: list1, list2: list2].collectEntries { label, maps ->
[(label): maps.countBy { it.iccId} ]
}.inject([:]) { result, label, counts ->
counts.entrySet().each { entry ->
if(!result[entry.key]) result[entry.key] = [:]
result[entry.key][(label)] = entry.value
}
result
}.collect { iccId , counts -> [iccId: iccId] << counts }
The code above give me this, it merge the maps and count the cuInfos but I'm not able to get the differences:
[
[cuInfo:T12, list1:1, list2:1], [cuInfo:T13, list1:1, list2:1], [cuInfo:T14, list1:1, list2:1], [cuInfo:T15, list1:1, list2:1],[cuInfo:T16, list1:1]
]
Please who can help to get it work please with example. Thanks
Is that what you're looking for:
def col1 = [
[name:"Apple",cuInfo:"T12",service:"3",startDate:"14-02-16 10:00",appId:"G12351"],
[name:"Apple",cuInfo:"T13",service:"3",startDate:"14-01-16 13:00",appId:"G12352"],
[name:"Apple",cuInfo:"T16",service:"3",startDate:"14-01-16 13:00",appId:"G12353"],
[name:"Google",cuInfo:"T14",service:"9",startDate:"10-01-16 11:20",appId:"G12301"],
[name:"Microsoft",cuInfo:"T15",service:"10",startDate:"26-02-16 10:20",appId:"G12999"]
]
def col2 = [
[cuInfo:"T12",service:"3",startDate:"14-01-16 13:22",appId:"G12355"],
[cuInfo:"T13",service:"3",startDate:"12-02-16 13:00",appId:"G12356"],
[cuInfo:"T14",service:"9",startDate:"10-01-16 11:20",appId:"G12300"],
[cuInfo:"T15",service:"10",startDate:"26-02-16 10:20",appId:"G12999"]
]
col1
.findAll { it.cuInfo in col2.cuInfo && !(it.subMap('cuInfo', 'service', 'startDate', 'appId') in col2) }
.collect {
def e = col2.find { i2 -> it.cuInfo == i2.cuInfo }
it << [startDate2: e.startDate, appId2: e.appId]
}
.each {
println it
}
?

Resources