Encong: Is it possible a field just informative? - machine-learning

Is it possible add to my train/eval csvs a field not input nor output, just informative? Just like additional information to each line

Yes, just do not map that field to an input neuron. How you do that depends on how you are preprocessing the data. Encog does not force you to use any data, if you do not want to use it, don't put it into the training dataset. Without more information that is about as clear as I can be.

Related

Will Hot encoded data in H2O effect the model somehow?

I have hot encoded data separately (there are multiple categories under a single main variable and 30 variables). I want to know if this will effect GB, GL, DRF in H2O . the documentation says for XGBOOST it internally encodes to one-hot For deep learning models i can may be use All factor parameter but I cannot find how to stop implicit hot encoding or let it be as the results will be same?
A little detail on why i needed to preprocess and encode. The original data i have consist of 30 columns and each row is answer from participant and each column row data has multiple categories as string separated by new line . The logical solution was to use hot encode suing dummy coding to split each cell and encode to get columns. The columns are not 150 and rows are 250. I want to find out if the hot encoded data is handled automatically in H2O ?
I have read documentation and tutorial published by amazonaws, may be I am missing something.
If you have categorical columns, you don't need to encode it. You just need to make sure that that column is read in as enum and not int. For Deeplearning, if you want to use all factors of the categorical columns, you just need to set the parameter use_all_factor_levels=True/true/TRUE for Python, Java or R.

When to use mlflow.set_tag() vs mlflow.log_params()?

I am confused about the usecase of mlflow.set_tag() vs mlflow.log_params() as both takes key and value pair. Currently, I use mlflow.set_tag() to set tags for data version, code version, etc and mlflow.log_params() to set model training parameters like loss, accuracy, optimizer, etc.
As teedak8s pointed out in the comments, tags and params are supposed to log different things. Params are something you want to tune based on the metrics, whereas tags are some extra information that doesn't necessarily associate with the model's performance. See how they use tags and params differently for sklearn, torch, and other packages in the Automatic Logging. That being said, as I understand it, there's no hard constraint on which to use to log which; they can be used interchangeably without error.

Missing Values in WEKA output

I'm trying to compare J48 and MLP on a variety of datasets using WEKA. One of these is: https://archive.ics.uci.edu/ml/datasets/primary+tumor. I have converted this to CSV form which can be easily imported into WEKA. You can download this file here: https://ufile.io/8nj13
I used the "numeric to nominal" on the class and all the attributes to fit the natural structure of the data. However, when I ran J48 (and MLP), I got a bunch of question marks "?" in my output, presumably due to not having enough observations/instances of the appropriate type.
How can I get around this? I'm sure there must be a filter for this kind of thing. I've attached a picture below.
The detailed accuracy table is displaying a question mark since no instance was actually classified as that specific class. This for example means that since no instance was classified as class 16, WEKA can not provide you with details regarding said class 16 classifications. This image might help you understand.
In regards to the amount of instances of the appropriate class, you can use the ClassBalancer filter under, found at weka/filters/supervised/instance/ClassBalancer. This should help balance out the amount of the various classes.
Also note that your dataset contains some missing values, this could be solved by either discarding the instances with missing data or running the ReplaceMissingValues filter, found at weka/filters/unsupervised/attribute/ReplaceMissingValues.

Rapidminer - unable to apply learning algorithm as process document is making regular to text

have the following process :
Process documents from files (where I load the text files with respective 6 classes ) --> this connects to set Role (which changes text attribute to REGULAR attribute to allow machine learning) -> Process documents from data ( I dont need the word vectors so I uncheck that, I keep text, within this process I tokenize, stopwords, stemming etc.) and then I feed this into validation operator. (bayes/svm)
What is happening here is in the example set the text column is going back to type "TEXT" from regular after running Process documents from Data. And hence I get the error Input ExampleSet has no attributes as there are zero regular attributes. And this is causing the process to fail. I have no idea why. I try to set the role again after this but then the error says "No examples in example set"
PLEASE HELP. I am stuck since two days!!!
EDIT : I think I know the issue - I was applying a 10-fold X-Validation on a dataset with few examples
I think I know the issue :
I was applying a 10-fold X-Validation on a dataset with very few examples

How to detect tabular data from a variety of sources

In an experimental project I am playing with I want to be able to look at textual data and detect whether it contains data in a tabular format. Of course there are a lot of cases that could look like tabular data, so I was wondering what sort of algorithm I'd need to research to look for common features.
My first thought was to write a long switch/case statement that checked for data seperated by tabs, and then another case for data separated by pipe symbols and then yet another case for data separated in another way etc etc. Now of course I realize that I would have to come up with a list of different things to detect - but I wondered if there was a more intelligent way of detecting these features than doing a relatively slow search for each type.
I realize this question isn't especially eloquently put so I hope it makes some sense!
Any ideas?
(no idea how to tag this either - so help there is welcomed!)
The only reliable scheme would be to use machine-learning. You could, for example, train a perceptron classifier on a stack of examples of tabular and non-tabular materials.
A mixed solution might be appropriate, i.e. one whereby you handled the most common/obvious cases with simple heuristics (handled in "switch-like" manner) as you suggested, and to leave the harder cases, for automated-learning and other types of classifier-logic.
This assumes that you do not already have a defined types stored in the TSV.
A TSV file is typically
[Value1]\t[Value..N]\n
My suggestion would be to:
Count up all the tabs
Count up all of new lines
Count the total tabs in the first row
Divide the total number of tabs by the tabs in the first row
With the result of 4, if you get a remainder of 0 then you have a candidate of TSV files. From there you may either want to do the following things:
You can continue reading the data and ignoring the error of lines with less or more than the predicted tabs per line
You can scan each line before reading to make sure all are consistent
You can read up to the line that does not fit the format and then throw an error
Once you have a good prediction of the amount of tab separated values you can use a regular expression to parse out the values [as a group].

Resources