Sagemaker Random Cut Forest Training with Validation

Sagemaker Random Cut Forest Training with Validation - machine-learning

troubling for some days with the sagemaker built-in rcf algorithm.
I would like to validate the model during training, but there might be things I didn't understand correctly.
First fitting only with training channel works fine:
container=sagemaker.image_uris.retrieve("randomcutforest", region, "us-east-1")
print(container)
rcf = sagemaker.estimator.Estimator(
image_uri=container,
role=role,
instance_count=1,
sagemaker_session=sagemaker.Session(),
instance_type="ml.m4.xlarge",
data_location=f"s3://{bucket}/{prefix}/",
output_path=f"s3://{bucket}/{prefix}/output"
)
rcf.set_hyperparameters(
feature_dim = 116,
eval_metrics = 'precision_recall_fscore',
num_samples_per_tree=256,
num_trees=100,
)
train_data = sagemaker.inputs.TrainingInput(s3_data=train_location, content_type='text/csv;label_size=0', distribution='ShardedByS3Key')
rcf.fit({'train': train_data})
[06/28/2021 09:45:24 INFO 140226936620864] Test data is not provided.
#metrics {"StartTime": 1624873524.6154933, "EndTime": 1624873524.6156445, "Dimensions": {"Algorithm": "RandomCutForest", "Host": "algo-1", "Operation": "training"}, "Metrics": {"setuptime": {"sum": 40.169477462768555, "count": 1, "min": 40.169477462768555, "max": 40.169477462768555}, "totaltime": {"sum": 13035.491704940796, "count": 1, "min": 13035.491704940796, "max": 13035.491704940796}}}
2021-06-28 09:45:50 Completed - Training job completed
ProfilerReport-1624873226: NoIssuesFound
Training seconds: 78
Billable seconds: 78
But when I want to validate my model during training:
train_data = sagemaker.inputs.TrainingInput(s3_data=train_location, content_type='text/csv;label_size=0', distribution='ShardedByS3Key')
val_data = sagemaker.inputs.TrainingInput(s3_data=val_location, content_type='text/csv;label_size=1', distribution='FullyReplicated')
rcf.fit({'train': train_data, 'validation': val_data}, wait=True)
I get the error:
AWS Region: us-east-1
RoleArn: arn:aws:iam::517714493426:role/service-role/AmazonSageMaker-ExecutionRole-20210409T152960
382416733822.dkr.ecr.us-east-1.amazonaws.com/randomcutforest:1
2021-06-28 10:14:12 Starting - Starting the training job...
2021-06-28 10:14:14 Starting - Launching requested ML instancesProfilerReport-1624875252: InProgress
......
2021-06-28 10:15:27 Starting - Preparing the instances for training.........
2021-06-28 10:17:07 Downloading - Downloading input data...
2021-06-28 10:17:27 Training - Downloading the training image..Docker entrypoint called with argument(s): train
Running default environment configuration script
[06/28/2021 10:17:53 INFO 140648505521984] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-conf.json: {'num_samples_per_tree': 256, 'num_trees': 100, 'force_dense': 'true', 'eval_metrics': ['accuracy', 'precision_recall_fscore'], 'epochs': 1, 'mini_batch_size': 1000, '_log_level': 'info', '_kvstore': 'dist_async', '_num_kv_servers': 'auto', '_num_gpus': 'auto', '_tuning_objective_metric': '', '_ftp_port': 8999}
[06/28/2021 10:17:53 INFO 140648505521984] Merging with provided configuration from /opt/ml/input/config/hyperparameters.json: {'num_trees': '100', 'num_samples_per_tree': '256', 'feature_dim': '116', 'eval_metrics': 'precision_recall_fscore'}
[06/28/2021 10:17:53 INFO 140648505521984] Final configuration: {'num_samples_per_tree': '256', 'num_trees': '100', 'force_dense': 'true', 'eval_metrics': 'precision_recall_fscore', 'epochs': 1, 'mini_batch_size': 1000, '_log_level': 'info', '_kvstore': 'dist_async', '_num_kv_servers': 'auto', '_num_gpus': 'auto', '_tuning_objective_metric': '', '_ftp_port': 8999, 'feature_dim': '116'}
[06/28/2021 10:17:53 ERROR 140648505521984] Customer Error: Unable to initialize the algorithm. Failed to validate input data configuration. (caused by ValidationError)
Caused by: Additional properties are not allowed ('validation' was unexpected)
Failed validating 'additionalProperties' in schema:
{'$schema': 'http://json-schema.org/draft-04/schema#',
'additionalProperties': False,
'definitions': {'data_channel_replicated': {'properties': {'ContentType': {'type': 'string'},
'RecordWrapperType': {'$ref': '#/definitions/record_wrapper_type'},
'S3DistributionType': {'$ref': '#/definitions/s3_replicated_type'},
'TrainingInputMode': {'$ref': '#/definitions/training_input_mode'}},
'type': 'object'},
'data_channel_sharded': {'properties': {'ContentType': {'type': 'string'},
'RecordWrapperType': {'$ref': '#/definitions/record_wrapper_type'},
'S3DistributionType': {'$ref': '#/definitions/s3_sharded_type'},
'TrainingInputMode': {'$ref': '#/definitions/training_input_mode'}},
'type': 'object'},
'record_wrapper_type': {'enum': ['None', 'Recordio'],
'type': 'string'},
's3_replicated_type': {'enum': ['FullyReplicated'],
'type': 'string'},
's3_sharded_type': {'enum': ['ShardedByS3Key'],
'type': 'string'},
'training_input_mode': {'enum': ['File', 'Pipe'],
'type': 'string'}},
'properties': {'state': {'$ref': '#/definitions/data_channel'},
'test': {'$ref': '#/definitions/data_channel_replicated'},
'train': {'$ref': '#/definitions/data_channel_sharded'}},
'required': ['train'],
'type': 'object'}
On instance:
{'train': {'ContentType': 'text/csv;label_size=0',
'RecordWrapperType': 'None',
'S3DistributionType': 'ShardedByS3Key',
'TrainingInputMode': 'File'},
'validation': {'ContentType': 'text/csv;label_size=1',
'RecordWrapperType': 'None',
'S3DistributionType': 'FullyReplicated',
'TrainingInputMode': 'File'}}
2021-06-28 10:18:10 Uploading - Uploading generated training model
2021-06-28 10:18:10 Failed - Training job failed
ProfilerReport-1624875252: Stopping
---------------------------------------------------------------------------
UnexpectedStatusException Traceback (most recent call last)
<ipython-input-34-c624ace00c69> in <module>
33
34
---> 35 rcf.fit({'train': train_data, 'validation': val_data}, wait=True)
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
680 self.jobs.append(self.latest_training_job)
681 if wait:
--> 682 self.latest_training_job.wait(logs=logs)
683
684 def _compilation_job_name(self):
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
1623 # If logs are requested, call logs_for_jobs.
1624 if logs != "None":
-> 1625 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
1626 else:
1627 self.sagemaker_session.wait_for_job(self.job_name)
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
3679
3680 if wait:
-> 3681 self._check_job_status(job_name, description, "TrainingJobStatus")
3682 if dot:
3683 print()
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
3243 ),
3244 allowed_statuses=["Completed", "Stopped"],
-> 3245 actual_status=status,
3246 )
3247
UnexpectedStatusException: Error for Training job randomcutforest-2021-06-28-10-14-12-783: Failed. Reason: ClientError: Unable to initialize the algorithm. Failed to validate input data configuration. (caused by ValidationError)
Caused by: Additional properties are not allowed ('validation' was unexpected)
Failed validating 'additionalProperties' in schema:
{'$schema': 'http://json-schema.org/draft-04/schema#',
'additionalProperties': False,
'definitions': {'data_channel_replicated': {'properties': {'ContentType': {'type': 'string'},
'RecordWrapperType': {'$ref': '#/definitions/record_wrapper_type'},
'S3DistributionType': {'$ref': '#/definitions/s3_replicated_type'},
'TrainingInputMode': {'$ref': '#/definitions/training_input_mode'}},
'type': 'object'},
'data_channel_sharded': {'properties': {'ContentType': {'type': 'string'},
Could someone help me, how I correctly implement this validation during training?
This would the best what could actually happen to me. :-D
Kind greetings,
Christina

I found the error: instead of 'validation' you need to name the channel 'test', then it works:
rcf.fit({'train': train_data, 'test': test_data}, wait=True)

Related

High charts Issue with stock bar chart multiple line series

It's working fine when all data available but issue happens when we select range in bar at bottom in angular 11
I don't know why this issue generated?
Code in angular:-
var datas = [{
data: [
[1630175400000, 0],
[1630089000000, 0.47],
[1630002600000, -0.48],
[1629916200000, 0.38],
[1629829800000, 0.18],
[1629743400000, 0.91],
[1629657000000, 0.01],
[1629570600000, -0.2],
[1629484200000, 0.01],
[1629397800000, -0.66],
[1629311400000, 0.04],
[1629225000000, -0.63],
[1629138600000, 0.07],
[1629052200000, -0.02],
[1628965800000, 0],
[1628879400000, 0.24],
[1628793000000, -0.45],
[1628706600000, 0.21],
[1628620200000, -0.04],
[1628533800000, -0.34],
[1628447400000, -0.08],
[1628361000000, 0.03],
[1628274600000, -0.23],
[1628188200000, 0.29],
[1628101800000, -0.19],
[1628015400000, 0.2],
[1627929000000, -0.13],
[1627842600000, -0.06],
[1627756200000, 0.02],
[1627669800000, -0.36]
],
id: "base1",
name: "Avg. growth rate",
type: "line"
}, {
data: [
[1628533800000, 117.442863],
[1630175400000, 117.476804],
[1630089000000, 117.476804],
[1630002600000, 116.930384],
[1629916200000, 117.488726],
[1629829800000, 117.039701],
[1629743400000, 116.834498],
[1629657000000, 115.777653],
[1629570600000, 115.764878],
[1629484200000, 115.996878],
[1629397800000, 115.988679],
[1629311400000, 116.764601],
[1629225000000, 116.7125],
[1629138600000, 117.458283],
[1629052200000, 117.377938],
[1628965800000, 117.395677],
[1628879400000, 117.395677],
[1628793000000, 117.116852],
[1628706600000, 117.64148],
[1628620200000, 117.392843],
[1628533800000, 117.442863],
[1628447400000, 117.841829],
[1628361000000, 117.933245],
[1628274600000, 117.902974],
[1628188200000, 118.170114],
[1628101800000, 117.826993],
[1628015400000, 118.045463],
[1627929000000, 117.811225],
[1627842600000, 117.968985],
[1627756200000, 118.045426],
[1627669800000, 118.024255]
],
id: "base2"
name: "IN (GBP)"
type: "line"
}]
Thanks in advance.

Live example with the issue: http://jsfiddle.net/BlackLabel/zwsLxyac/
You have unsorted data which causes Highcharts error #15: https://assets.highcharts.com/errors/15/
You need to sort your data before it is passed to a chart:
datas.forEach((series, index) => {
series.data.sort((a, b) => a[0] - b[0]);
});
Live demo: http://jsfiddle.net/BlackLabel/0nuezd8b/
API Reference: https://api.highcharts.com/highstock/series.line.data

How to properly query Postgresql JSONB array of hashes on Ruby on Rails 6?

This is my column:
[
{ id: 1, value: 1, complete: true },
{ id: 2, value: 1, complete: false },
{ id: 3, value: 1, complete: true }
]
First, is there a "correct" way to work with a jsonb scheme? should I redesign to work with a single json instead of the array of hashes?
I have about 200 entries on the database, the column status has 200 of those itens.
How would I perform a query to get the count of true/false?
How can I query for ALL complete itens? I can query for the database rows in which the json has an item complete, but I can't query for all the itens, in all rows of the database that are complete.
Appreciate the help, thank you

Aha! I found it here:
https://levelup.gitconnected.com/how-to-query-a-json-array-of-objects-as-a-recordset-in-postgresql-a81acec9fbc5
Say your dataset is like this:
[{
"productid": "3",
"name": "Virtual Keyboard",
"price": "150.00"
}, {
"productid": "1",
"name": "Dell 123 Laptop Computer",
"price": "1300.00"
},
{
"productid": "8",
"name": "LG Ultrawide Monitor",
"price": "190.00"
}]
The proper way to count it, is like this:
select items.name, count(*) as num from
purchases,jsonb_to_recordset(purchases.items_purchased) as items(name text)
group by items.name
order by num Desc
Works like a charm and is extremely fast.
To do it in Rails, you need to use Model.find_by_sql(....) and indicate your select therem. I'm sure there are probably better ways to do it.

Youtube debug info field details

I am just doing a r&d on youtube video details. When we play a video in youtube and if we do right click, then we can see an option called copy debug info. If we copy that then there are lot of fields comes as below, I am just curious to know the details of these below fields.
{
"ns": "yt",
"el": "detailpage",
"cpn": "TA1LSqRVROm9Q2rb",
"docid": "tPDj7FhbUso",
"ver": 2,
"referrer": "https://www.youtube.com/feed/history",
"cmt": "208.944",
"ei": "Ups5X-iDO8ng4-EP7o6KwAg",
"fmt": "247",
"fs": "0",
"rt": "151.471",
"of": "yuFWq23SkzutVGx461bO4g",
"euri": "",
"lact": 7,
"cl": "326301777",
"mos": 0,
"state": "4",
"vm": "CAEQARgEKiBsUmpoTXRxc1czUTVHZ2RJbmktOXBNdnY3X3JnV3ItNjoyQUdiNlo4T3BuV0tmTXhwbW5wWDZjUDA3X3JPYU5PXzVSd1Bha2szZE9jNEhCY0ZQTkE",
"volume": 100,
"subscribed": "1",
"cbr": "Chrome",
"cbrver": "84.0.4147.125",
"c": "WEB",
"cver": "2.20200814.00.00",
"cplayer": "UNIPLAYER",
"cos": "Macintosh",
"cosver": "10_15_6",
"hl": "en_US",
"cr": "IN",
"len": "268.121",
"fexp": "23744176,23804281,23839597,23856950,23857950,23858057,23859802,23862346,23868323,23880389,23882502,23883098,23884386,23890960,23895671,23900945,23907595,23911055,23915993,23916148,23918272,23918598,23927906,23928508,23930220,23931938,23934047,23934090,23934970,23936412,24631210,3300107,3300133,3300161,3313321,3316358,3316377,3317374,3317643,3318816,3318887,3318889,3319024,9405957,9449243",
"afmt": "251",
"inview": "NaN",
"vct": "208.944",
"vd": "268.121",
"vpl": "207.000-208.944",
"vbu": "204.000-268.121",
"vpa": "1",
"vsk": "0",
"ven": "0",
"vpr": "1",
"vrs": "4",
"vns": "2",
"vec": "null",
"vemsg": "",
"vvol": "1",
"vdom": "1",
"vsrc": "1",
"vw": 1159,
"vh": 652,
"creationTime": 158827.80500000808,
"totalVideoFrames": 128,
"droppedVideoFrames": 0,
"corruptedVideoFrames": 0,
"lct": "208.944",
"lsk": false,
"lmf": false,
"lbw": "993748.652",
"lhd": "0.057",
"lst": "0.000",
"laa": "itag=251,type=3,seg=26,range=3299085-3408885,time=260.0-268.1,off=0,len=109801,end=1,eos=1",
"lva": "itag=247,type=3,seg=53,range=17411843-17472031,time=264.0-268.1,off=0,len=60189,end=1,eos=1",
"lar": "itag=251,type=3,seg=26,range=3299085-3408885,time=260.0-268.1,off=0,len=109801,end=1,eos=1",
"lvr": "itag=247,type=3,seg=53,range=17411843-17472031,time=264.0-268.1,off=0,len=60189,end=1,eos=1",
"lab": "200.001-268.121",
"lvb": "204.000-268.080",
"ismb": 3000000,
"relative_loudness": "-5.140",
"optimal_format": "720p",
"user_qual": "auto",
"debug_videoId": "tPDj7FhbUso",
"0sz": false,
"op": "",
"yof": false,
"dis": "",
"gpu": "Intel(R)_UHD_Graphics_630",
"cgr": true,
"debug_playbackQuality": "hd720",
"debug_date": "Mon Aug 17 2020 02:19:46 GMT+0530 (India Standard Time)"
}
For an example - docid filed is known for the video Id of youtube. Like that I wanna know other field details. If anyone can help me with, that would be great..

I doubt there is a easy way to do this - (AFAIK, no YouTube Data API exposes such values), but, if you really want to check, you can enter to view-source:https://www.youtube.com/watch?v=<VIDEO_ID> - where <VIDEO_ID> is the id of the YouTube video.
There are some values obtained from the copy debug info - but not all of them, I'm affraid.

LUA: add values in nested table

I have a performance issue in my application. I would like to gather some ideas on what I can do to improve it. The application is very easy: I need to add values inside a nested table to get the total an user wants to pay out of all the pending payments. The user chooses a number of payments and I calculate how much it is they will pay.
This is what I have:
jsonstr = "{ "name": "John",
"surname": "Doe",
"pending_payments": [
{
"month": "january",
"amount": 50,
},
{
"month": "february",
"amount": 40,
},
{
"month": "march",
"amount": 45,
},
]
}"
local lunajson = require 'lunajson'
local t = lunajson.decode(jsonstr)
local limit -- I get this from the user
local total = 0;
for i=1, limit, 1 do
total = total + t.pending_payments[i].amount;
end;
It works. At the end I get what I need. However, I notice that it takes ages to do the calculation. Each JSON has only twelve pending payments (one per month). It is taking between two to three seconds to come up with a result!. I tried in different machines and LUA 5.1, 5.2., 5.3. and the result is the same.
Can anyone please suggest how I can implement this better?
Thank you!

For this simple string, try the test code below, which extracts the amounts directly from the string, without a json parser:
jsonstr = [[{ "name": "John",
"surname": "Doe",
"pending_payments": [
{
"month": "january",
"amount": 50,
},
{
"month": "february",
"amount": 40,
},
{
"month": "march",
"amount": 45,
},
]
}]]
for limit=0,4 do
local total=0
local n=0
for a in jsonstr:gmatch('"amount":%s*(%d+),') do
n=n+1
if n>limit then break end
total=total+tonumber(a)
end
print(limit,total)
end

I found the delay had nothing to do with the calculation in LUA. It was related with a configurable delay in the retrieval of the limit variable.
I have nothing to share here related to the question asked since the problem was actually in an external element.
Thank #lfh for your replies.

Adding not validated dictionary with python eve along with an image

I want to import images in mongoDB along with any dictionary. The dictionary should provide image tags, of which types, numbers and names I can't know at the moment I define the schema.
I'm trying to add a dictionary in eve without success:
curl -F"attr={\"a\":1}" -F "img_id=2asdasdasd" -F "img_data=#c:\path\
1.png;type=image/png" http://127.0.0.1:5000/images
{"_status": "ERR", "_issues": {"attr": "must be of dict type"}, "_error": {"message": "Insertion failure: 1 document(s)
contain(s) error(s)", "code": 422}}
My schema definition looks like that:
'schema': {
#Fixed attributes
'original_name': {
'type': 'string',
'minlength': 4,
'maxlength': 1000,
},
'img_id': {
'type': 'string',
'minlength': 4,
'maxlength': 150,
'required': True,
'unique': True,
},
'img_data': {
'type': 'media'
},
#Additional attributes
'attr': {
'type': 'dict'
}
}
Is it possible at all? Should the schema for dicts be fixed?
EDIT
I wanted to add image first and the dictionary after it, but get an error in PATCH request:
C:\Windows\SysWOW64>curl -X PATCH -i -H "Content-Type: application/json" -d "{\
"img_id\":\"asdasdasd\", \"attr\": {\"a\": 1}}" http://localhost:5000/images/asd
asdasd
HTTP/1.0 405 METHOD NOT ALLOWED
Content-Type: application/json
Content-Length: 106
Server: Eve/0.7.4 Werkzeug/0.9.4 Python/2.7.3
Date: Wed, 28 Jun 2017 22:55:54 GMT
{"_status": "ERR", "_error": {"message": "The method is not allowed for the requested URL.", "code": 405}}

I have posted an issue on Github for the same situation. However I have came with a workaround.
Override the dict validator:
class JsonValidator(Validator):
def _validate_type_dict(self, field, value):
if type(value) is dict:
pass
try:
json.loads(value)
except:
self._error(value, "Invalid JSON")
app = Eve(validator=JsonValidator)
Next, add an insert hook:
def multi_request_json_parser(documents):
for item in documents:
if 'my_json_field' in item.keys():
item['my_json_field'] = json.loads(item['my_json_field'])
app.on_insert_myendpoint += multi_request_json_parser
The dict validator must be overidden, because else the insert hook will not be called due to an validation error.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Sagemaker Random Cut Forest Training with Validation - machine-learning

I found the error: instead of 'validation' you need to name the channel 'test', then it works: rcf.fit({'train': train_data, 'test': test_data}, wait=True)

Related

High charts Issue with stock bar chart multiple line series

How to properly query Postgresql JSONB array of hashes on Ruby on Rails 6?

Youtube debug info field details

LUA: add values in nested table

Adding not validated dictionary with python eve along with an image

Categories

Resources