Question about the "Getting Started with BigQuery ML" excercise - machine-learning

Did somebody work on this exercise ? It is a way to estimate transaction in JUNE 2017.
https://codelabs.developers.google.com/codelabs/bqml-intro/index.html#0
When we compare orders JUNE 2017 prediction (step6) versus reality (meaning what we have in BQ), the difference is quite important (more than 50%) - do you know why?

Comparing a single datapoint may not be the best way to evaluate the model performance.
Please start with reviewing the output of ML.Evaluate to understand how the model is behaving overall.

As mentioned by #girishkumar, comparing a single datapoint may not be the best way to evaluate the model performance; at the same time, please consider that the purpose of the codelab is to show how BigQuery ML works.
However, I replicated the lab and observed that if you remove the limit of the SQL statement in the step 4 you got better results in the model evaluation step.
By last, if you are interested in learning how to use this tool, you can try the following tutorials.

Related

Robust Standard Errors in spatial error models

I am fitting a Spatial Error Model using the errorsarlm() function in the spdep library.
The Breusch-Pagan test for spatial models, calculated using the bptest.sarlm() function, suggest the presence of heteroskedasticity.
A natural next step would be to get the robust standard error estimates and update the p-values. In the documentation of the bptest.sarlm() function says the following:
"It is also technically possible to make heteroskedasticity corrections to standard error estimates by using the “lm.target” component of sarlm objects - using functions in the lmtest and sandwich packages."
and the following code (as reference) is presented:
lm.target <- lm(error.col$tary ~ error.col$tarX - 1)
if (require(lmtest) && require(sandwich)) {
print(coeftest(lm.target, vcov=vcovHC(lm.target, type="HC0"), df=Inf))}
where error.col is the spatial error model estimated.
Now, I can easily adapt the code to my problem and get the robust standard errors.
Nevertheless, I was wondering:
What exactly is the “lm.target” component of sarlm objects? I can not find any mention to it in the spdep documentation.
What exactly are $tary and $tarX? Again, it does not seem to be mentioned on the documentation.
Why documentation says it is "technically possible to make heteroskedasticity corrections"? Does it mean that proposed approach is not really recommended to overcome issues of heteroskedasticity?
I report this issue on github and had a response by Roger Bivand:
No, the approach is not recommended at all. Either use sphet or a Bayesian approach giving the marginal posterior distribution. I'll drop the confusing documentation. tary is $y - \rho W y$ and similarly for tarX in the spatial error model case. Note that tary etc. only occur in spdep in documentation for localmoran.exact() and localmoran.sad(); were you using out of date package versions?

PPO Update Schedule in OpenAi Baselines Implementations

I'm trying to read through the PPO1 code in OpenAi's Baselines implementation of RL algorithms (https://github.com/openai/baselines) to gain a better understanding as to how PPO works, how one might go about implementing it, etc.
I'm confused as to the difference between the "optim_batchsize" and the "timesteps_per_actorbatch" arguments that are fed into the "learn()" function. What are these hyper-parameters?
In addition, I see in the "run_atari.py" file, the "make_atari" and "wrap_deepmind" functions are used to wrap the environment. In the "make_atari" function, it uses the "EpisodicLifeEnv", which ends the episode once the a life is lost. On average, I see that the episode length in the beginning of training is about 7 - 8 timesteps, but the batch size is 256, so I don't see how any updates can occur. Thanks in advance for your help.
I've been going through it on my own as well....their code is a nightmare!
optim_batchsize is the batch size used for optimizing the policy, timesteps_per_actorbatch is the number of time steps the agent runs before optimizing.
On the episodic thing, I am not sure. Two ways it could happen, one is waiting until the 256 entries are filled before actually updating, or the other one is filling the batch with dummy data that does nothing, effectively only updating the 7 or 8 steps that the episode lasted.

best algorithm to predict 3 similar blogs based on a blog props and contents only

{
"blogid": 11,
"blog_authorid": 2,
"blog_content": "(this is blog complete content: html encoded on base64 such as) PHNlY3Rpb24+PGRpdiBjbGFzcz0icm93Ij4KICAgICAgICA8ZGl2IGNsYXNzPSJjb2wtc20tMTIiIGRhdGEtdHlwZT0iY29udGFpbmVyLWNvbnRlbn",
"blog_timestamp": "2018-03-17 00:00:00",
"blog_title": "Amazon India Fashion Week: Autumn-",
"blog_subtitle": "",
"blog_featured_img_link": "link to image",
"blog_intropara": "Introductory para to article",
"blog_status": 1,
"blog_lastupdated": "\"Mar 19, 2018 7:42:23 AM\"",
"blog_type": "Blog",
"blog_tags": "1,4,6",
"blog_uri": "Amazon-India-Fashion-Week-Autumn",
"blog_categories": "1",
"blog_readtime": "5",
"ViewsCount": 0
}
Above is one sample blog as per my API. I have a JsonArray of such blogs.
I am trying to predict 3 similar blogs based on a blog's props(eg: tags,categories,author,keywords in title/subtitle) and contents. I have no user data i.e, there is no logged in user data(such as rating or review). I know that without user's data it will not be accurate but I'm just getting started with data science or ML. Any suggestion/link is appreciated. I prefer using java but python,php or any other lang also works for me. I need an easy to implement model as I am a beginner. Thanks in advance.
My intuition is that this question might not be at the right address.
BUT
I would do the following:
Create a dataset of sites that would be an inventory from which to predict. For each site you will need to list one or more features: Amount of tags, amount of posts, average time between posts in days, etc.
Sounds like this is for training and you are not worried about accuracy
too much, numeric features should suffice.
Work back from a k-NN algorithm. Don't worry about the classifiers. Instead of classifying a blog, you list the 3 closest neighbors (k = 3). A good implementation of the algorithm is here. Have fun simplifying it for your purposes.
Your algorithm should be a step or two shorter than k-NN which is considered to be among simpler ML, a good place to start.
Good luck.
EDIT:
You want to build a recommender engine using text, tags, numeric and maybe time series data. This is a broad request. Just like you, when faced with this request, I’d need to dive in the data and research best approach. Some approaches require different sets of data. E.g. Collaborative vs Content-based filtering.
Few things may’ve been missed on the user side that can be used like a sort of rating: You do not need a login feature get information: Cookie ID or IP based DMA, GEO and viewing duration should be available to the Web Server.
On the Blog side: you need to process the texts to identify related terms. Other blog features I gave examples above.
I am aware that this is a lot of hand-waving, but there’s no actual code question here. To reiterate my intuition is that this question might not be at the right address.
I really want to help but this is the best I can do.
EDIT 2:
If I understand your new comments correctly, each blog has the following for each other blog:
A Jaccard similarity coefficient.
A set of TF-IDF generated words with
scores.
A Euclidean distance based on numeric data.
I would create a heuristic from these and allow the process to adjust the importance of each statistic.
The challenge would be to quantify the words-scores TF-IDF output. You can treat those (over a certain score) as tags and run another similarity analysis, or count overlap.
You already started on this path, and this answer assumes you are to continue. IMO best path is to see which dedicated recommender engines can help you without constructing statistics piecemeal (numeric w/ Euclidean, tags w/ Jaccard, Text w/ TF-IDF).

In Weka make a arff file from text file

In naive byes classifier i want to find out the accuracy from my train and test. But my train set is like
Happy: absolution abundance abundant accolade accompaniment accomplish accomplished achieve achievement acrobat admirable admiration adorable adoration adore advance advent advocacy aesthetics affection affluence alive allure aloha
Sad: abandon abandoned abandonment abduction abortion abortive abscess absence absent absentee abuse abysmal abyss accident accursed ache aching adder adrift adultery adverse adversity afflict affliction affront aftermath aggravating
Angry: abandoned abandonment abhor abhorrent abolish abomination abuse accursed accusation accused accuser accusing actionable adder adversary adverse adversity advocacy affront aftermath aggravated aggravating aggravation aggression aggressive aggressor agitated agitation agony alcoholism alienate alienation
For test set
data: Dec 7, 2014 ... This well-known nursery rhyme helps children practice emotions, like happy, sad, scared, tired and angry. If You're Happy and You Know It is ...
Now the problem is how do i convert them into arff file
Your training set is not appropriate for training a model for Weka however these information can be used in feature extraction.
Your Test set can be converted into an arff file. From every message extract these basics features like
1. Any form of the word 'Happy' is present or not
2. Any form of the word 'Sad' is present or not
3. Any form of the word 'Angry' is present or not
4. TF-IDF
etc.
then for some messages (say 70%) you should assign one class {Happy, Sad, Angry} manually and for remaining 30% you can test through your model.
More about arff file is given here:
http://www.cs.waikato.ac.nz/ml/weka/arff.html
Where to start ;).
As written before your "training data" is not real training data. Training data should be texts similar to the data you are using for Testing. However, in your example it is merely a list of words. My gut feeling is that you would be better of to avoid using weka, count the number of occurrence in each category, and the take the one with most matches.
In case you want use Weka I'd recommend to use the toolbox https://www.knime.org which nicely integrates with weka.
You should then convert your data into a bag of words representation. This is basically you have the number of times each word occurs in each of the texts as features.
Also for this Knime has nice package. http://www.tech.knime.org/files/KNIME-TextProcessing-HowTo.pdf

Does it help to duplicate original data in order to make more data for building model?

I just got an interview question.
"Assume you want to build a statistical or machine learning model, but you have very limited data on hand. Your boss told you can duplicate original data several times, to make more data for building the model" Does it help?
Intuitively, it does not help, because duplicating original data doesn't create more "information" to feed the model.
But is there anyone can explain it more statistically? Thanks
Consider e.g. variance. The data set with the duplicated data will have the exact same variance - you don't have a more precise estimate of the distrbution afterwards.
There are, however, some exceptions. For example bootstrap validation helps when evaluating your model, but you have very little data.
Well, it depends on exactly what one means by "duplicating the data".
If one is exactly duplicating the whole data set a number of times, then methods based on maximum likelihood (as with many models in common use) must find exactly the same result since the log likelihood function of the duplicated data is exactly a multiple of the unduplicated data's log likelihood, and therefore has the same maxima. (This argument doesn't apply to methods which aren't based on the likelihood function; I believe that CART and other tree models, and SVM's, are such models. In that case you'll have to work out a different argument.)
However, if by duplicating, one means duplicating the positive examples in a classification problem (which is common enough, since there are often many more negative examples than positive), then that does make a difference, since the likelihood function is modified.
Also if one means bootstrapping, then that, too, makes a difference.
PS. Probably you'll get more interest in this question on stats.stackexchange.com.

Resources