Sample Submission on Boston Housing Dataset - machine-learning

I want to enter my submission for the competition at
https://www.kaggle.com/c/boston-dataset/data. But as it turns out there is sample Submission file.
Now I have tried removing column names, index but keep getting different errors.
Error snip
Any help is appreciated!

Try this snippet:
submission = pd.DataFrame({ 'ID': test_data.ID.values, 'medv': predictions })
submission.to_csv("submission.csv", index=False)
where predictions is a list of your predictions for the test set.

Related

Text Classification of News Articles Using Spacy

Dataset : Csv files containing around 1500 data with columns (Text,Labels) where Text is the news article of Nepali Language and Label is its genre(Health, World,Tourism, Weather) and so on.
I am using Spacy to train my Text Classification Model. So far, I have converted the dataset to a dataframe which looks like this
and then into a spacy acceptable format through the code
dataset['tuples'] = dataset.apply(
lambda row: (row['Text'],row['Labels']), axis=1)
training_data = dataset['tuples'].tolist()
which gives me the list of tuples in my training dataset like [('text...','label...'),('text...','label...')]
Now, how can I do text classification here?
In the spacy's documentation, I found
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")
Do we have to add the labels according to the labels or should we use positive/negative as well? Does spacy generate the labels according to our dataset after training or not?
Any suggestions please?
You have to add your own labels. So, in your case:
textcat.add_label('Health')
textcat.add_label('World')
textcat.add_label('Tourism')
...
spacy then will be able to predict only those categories, that you added in the above block of code
There is a special format for training data: each element of your list with data is a tuple that contains:
Text
Dictionary with one element only. cats is a key and another dictionary is a value. That another dictionary contains all your categories as keys and 1 or 0 as values indicating whether this category is correct or not.
So, your data should look like this:
[('text1', {'cats' : {'category1' : 1, 'category2' : 0, ...}}),
('text2', {'cats' : {'category1' : 0, 'category2' : 1, ...}}),
...]

Always getting the same prediction for the different input using MLClassifier in Core ML

I have a pretty simple .csv table:
I use this table as an input parameter when creating a model in Create ML, where the target is PRICE, CITY and DATE are feature columns.I need to get a price prediction for a giving date in the future for a particular city.
The code below gives a different price for different dates, as it should work, however, it gives the same result regardless of the given city:
let prediction = try? model.prediction(
CITY: name, DATE: date
)
let price = prediction?.PRICE
The price for a given date in the future in Paris should not be equal to the price for the same date in New York.
Do I really need to create 2 different models for each of the cities?
Thank you!
Is this all the data your are giving it? You're going to need to give it a lot more to go off of.
So, it was a bug in the .csv file, it looks like the CITY column was recorded with invisible characters.

SPSS Frequency Plot Complication

I am having a hard time generating precisely the frequency table I am looking for using SPSS.
The data in question: cases (n = ~800) with categorical variables DX_n (n = 1-15), each containing ICD9 codes, many of which are the same code. I would like to create a frequency table that groups the DX_n variables such that I can view frequency of every diagnosis in this sample of cases.
The next step is to test the hypothesis that the clustering of diagnoses in this sample is different than that of another. If you have any advice as to how to test this, that would be really appreciated as well!
Thanks!
Edit: My attempts:
1) Analyze -> Descriptive Statistics -> Frequencies; then add variables DX_n (1-15) and display frequency charts. The output is frequencies of each ICD9 code per DX_n variable (so 15 tables are generated - I'm hoping to just have one grouped table).
2) I tried adjusting the output format to organize by variable and also to compare variables but neither option gives the output I'm looking for.
I think what you are looking for CTABLES. It can do parallel columns of frequencies, and it includes a column proportions test that can see whether the distributions differ
Thank you, JKP! You set me on exactly the right track. I'm not sure how I overlooked that menu. Just to clarify in case anyone else comes along needing to figure this out:
Group diagnosis variables into a multiple response set using Analyze > Custom Tables > Multiple Response Sets. Code the variables as categories.
http:// i.imgur.com/ipE9suf.png
Create a custom table with your new multiple response set as a row and the subsets to compare as columns. I set summary statistics to compute from rows and added the column n% column (sorted descending).
http:// i.imgur.com/hptIkfh.png
Under test statistics, include a column proportions z-test as JKP suggested.
http:// i.imgur.com/LYI6ZRl.png
Behold, your results:
http:// i.imgur.com/LgkBA8X.png
Thanks again, and best of luck to anyone else who runs across this.
-GCH
p.s. Sorry everyone, I was going to post images but don't have enough reputation points yet. Images detailing the steps in the GUI can be found at the obfuscated links above.

Machine learning predict text fields based on text fields

I am working on machine learning and prediction for about a month. I have tried IBM watson with bluemix, Amazon machine learning, and predictionIO. What I want to do is to predict a text field based on other fields. My CSV file have four text fields named Question,Summary,Description,Answer and about 4500 lines/Recrods. No numerical fields are in the uploaded dataset. A typical record looks like below.
{'Question':'sys down','Summary':'does not boot after OS update','Description':'Desktop does not boot','Answer':'Switch to safemode and rollback last update'}
On IBM watson I found a question in their forums and a reply that custom corpus upload is not possible right now. Then I moved to Amazon machine learning. I followed their documentation and was able to implement prediction in a custom app using API. I tested on movielens data and everything was numerical. I successfully uploaded data and got movie recommendations with their python-boto library. When I tried uploading my CSV file The problem I had was that no text field can be selected as target. Then I added numerical values corresponds to each value in CSV.This approcah made prediction successful but the accuracy was not right. May be the CSV had to be formatted in a better way.
A record from the movielens data is pasted below. It says that userID 196 gave movieID 242 a two star rating at time (Unix timestamp) 881250949.
196 242 3 881250949
Currently I am trying predictionIO. A test on movielens database was run successfully without issues as told in the documentation using recommendation template. But still its unclear the possibilities of predicting a text field based on other text fields.
Does prediction run on numerical Fields only or a text field can be predicted based on other text fields?
No, prediction does not only run on numerical fields. It could be anything including text. My guess is that the MovieLens data uses ID instead of actual user and movie names because
this saves storage space (this dataset is there for a long time and back then storage is definitely a concern), and
there is no need to know the actual user name (privacy concern)
For your case, you might want to look at the text classification template https://docs.prediction.io/demo/textclassification/ . You will need to model how you want each record to be classified.

how to normalize the records? e.g. several similar columns into rows

I am using Pentaho Kettle and thinking on way to normalize my flat file (csv) data. Eventually store it to database.
csv structure: item name, store1 sold quantity, store2 sold quantity, store...
expected result: item name, store name, sold quantity
Any guidance is appreciated.
You can do this with the Row Normalizer step as long as the number of stores is fixed or at least has a maximum. If it's variable, you'll have to use a JavaScript step, or a UDJC. See the docs for how to use these steps:
PDI Transform Steps
If it's variable, I'd consider preprocessing the file before loading. I've done this with Python and it works pretty well.

Resources