I want to create a confusion matrix as it is shown here, though I use a different dataset than FashinMNIST. Concretely, I want the axes to contain names such as "banana", "orange", "cucumber", etc. I have a folder where images of bananas, etc. are stored, which I load with PyTorch's ImageFolder class. Now, PyTorch automatically assings the numbers 0 to 9 to my classes (I have 10 classes in total), but I'm not sure whether the label "banana" always gets the same number 0.
How do I find out, when using the ImageFolder class, which index belongs to which label (subfolder name)?
Class ImageFolder has attribute class_to_idx. Check the docs.
x = torchvision.datasets.ImageFolder(root=path, transform=transform)
print(x.class_to_idx)
Related
I am trying to create a model to predict whether or not someone is at risk of a stroke. My data contains some "object" variables that could easily be coded to 0 and 1 (like sex). However, I have some object variables with 4+ categories (e.g. type of job).
I'm trying to encode these objects into integers so that my models can ingest them. I've come across two methods to do so:
Create dummy variables for each feature, which creates more columns and encodes them as 0 and 1
Convert the object into an integer using LabelEncoder, which assigns values to each category like 0, 1, 2, 3, and so on within the same column.
Is there a difference between these two methods? If so, what is the recommended best path forward?
Yeah this 2 are different. If you used 1 st method it creates more cols. That means more features for model to get fit. If you use second way it create only 1 feature for model to get fit.In machine learning both ways have set of own pros and cons.
Recommending 1 path is depend on the ml algorithm you use, feature importance, etc...
Go the dummy variable route.
Say you have a column that consists of 5 job types: construction worker, data scientist, retail associate, machine learning engineer, and bartender. If you use a label encoder (0-4) to keep your data narrow, your model is going to interpret the job title of "data scientist" as 1 greater than the job title of "construction worker". It would also interpret the job title of "bartender" is 4 greater than "construction worker".
The problem here is that these job types really have no relation to each other as they are purely categorical variables. If you dummy out the column, it does widen your data but you have a far more accurate representation of what the data actually represents.
Use dummy variable, thereby creating more columns/features for fitting your data. As your data will be scaled beforehand it will not create problems in the future.
Overall, the accuracy of any model depends on the no. of features involved and the more features we have, the more accurately we can predict
I am trying to train a neural net. to extract specific information from text. I need to find entity information like attribute, dependency, etc.
The text input will be like this:
doc = "Student has name, surname, id, and number..."
I have preprocessed this doc. and have POS-Tags and dependency information to feed my network. On the other hand, the wanted output is like:
Entity.
Attribute.
Relation
name....
name, type...
name, cardinality, etc.
I do have a dataset of wanted output but have a problem about to feed the system.
I have embedded the X but had problem about to embed Y.
Example of one Y row :
y = [("Student","Course"),0,(("name","id","address"),("title","level","credits"),((0,0,1),(0,0,0)),((0,0,0),(0,0,0)),((0,1,0),(0,1,0))),(("takes"),"Student","Course",((1,"N")),(("N",1)))]
so, I want to create multiple input and output models but am stuck with the feeding model.
Overall, I have a structure like a list of lists or tuples. Lenght of lists might be changed according to the input.
Any ideas?
In the template for training CRF++, how can I include a custom dictionary.txt file for listed companies, another for popular European foods, for eg, or just about any category.
Then provide a sample training data for each category whereby it learns how those specific named entites are used within a context for that category.
In this way, I as well as the system, can be sure it correctly understood how certain named entites are structured in a text, whether a tweet or a Pulitzer prize winning news article, instead of providing hundred megabytes of data.
This would be rather cool. Model would have a definite dictionary of known entites (which does not need to be expanded) and a statistical approach on how those known entites are structured in human text.
PS - Just for clarity, not yearning for a regex ner. These are only cool if you got lots in the dictionary, lots of rule and lots of dulltime.
I think what you are talking about is Gazetteers list (dictionary.txt).
You would have to include corresponding feature for a word in training data and then specify it in template file.
For Example: Your list contains the entity: Hershey's
and training data has a sentence: I like Hershey's chocolates.
So when you arrange the data in CoNLL Format (for CRF++), you can add a column (which shall have values 0 or 1 , indicating is the word is present in dictionary) which will have 0 value for all words, except Hershey's.
You also have to include this column as feature in template file.
To get a better understanding on Template File and NER training with CRF++, you can watch the below videos and comment your doubts :)
1) https://youtu.be/GJHeTvDkIaE
2) https://youtu.be/Ur5umC4BwN4
EDIT: (after viewing the OP's comment)
Sample Training Data with extra features: https://pastebin.com/fBgu8c67
I've added 3 features. The IsCountry feature value ( 1 or 0 ) can be obtained from a Gazetteers list of countries. The other 2 features can be computed offline. Note that Headers are added in file for reference only, should not be include in training data file.
Sample Template File for the above data : https://pastebin.com/LPvAGCVL
Note that, Test Data should also be in the same format as Train Data, with the same features / same no of columns.
Hello I am new to WEKA and am using weka 3.6.10.
Sorry if the answer to this question is something obvious.
I have a dataset containing 10 attributes and one decision class. Decision class is composed of values {1,2,3,4}, is there a way to change configuration so that the values would be considered as {1} and {2,3,4}(binary) rather than each of the values separately without modifying the other attributes?
I had a look at the WEKA filter but did not find anything useful.
Thanks guys
Use an Unsupervised Attribute filter, e.g. the NumericToBinary filter. In the topmost field of the configuration dialog, enter the posistion of the "Decision class" attribute. If it is in the 8th column, enter 8.
The filter will create "dummy variable" columns for each unique value of this attribute. If there are 4 unique values, after applying this filter your dataset will have 4 additional columns. Remove 3 of them.
Suppose we have a categorical variable X which can take on 10 values. There are counts inside each of these 10 categories. I want to see whether there are correlations between categories. How would I do this in SPSS? Is there a way to split X into 10 subvariables?
I go to Analyze ---> Correlate ---> Bivariate and can only find the variable X (not the 10 categories).
It sounds like you have a single variable with mutually exclusive categories. If this is the case then if the variable equals a particular category, then that means it does not equal any other category. Therefore, it makes no sense to correlate such a variable.
If you do not have mutually exclusive categories (i.e., you have what is sometimes called a multi-variable) then your 10 response options would be represented as 10 separate variables in SPSS. You could then potentially use Analyze - correlate - bivariate to examine relationships between category co-occurrence.