Machine Learning to output a string based on previous learnt combinations - machine-learning

I'm wondering if it is possible to have a computer output a string based on previously learned inputs.
For example if I told the machine to learn the following:-
toLearn = [
{input: '12334', output: 'ASDW'},
{input: '12735', output: 'EDSW'},
{input: '23725', output: 'RTEF'},
{input: '75612', output: 'HTEG'},
etc..
]
I know there are 100,000 possible combinations to the above.
If I provided the machine with even 10% of that number, would the machine be able to tell me from what it has learnt that the following would be true?
{input: '29847', output: 'FYEW'}
Would it also be possible for the machine to provide the correct output just based on the input provided?
{input: '29847'}
// output: 'FYEW'
If I am barking up the wrong tree with machine learning how would something like this be possible to achieve?
All of the figures above are not true values they are just a representation of what I am trying to achieve. The real model would have around 250,000,000,000 possible combinations.

Machine learning can pick out and learn such patterns if they exist in your data. If this is more of a key-value pair mapping with few or no patterns between keys->values, then no, there is no point in using ML.

Related

How can I get predictions from these pretrained models?

I've been trying to generate human pose estimations, I came across many pretrained models (ex. Pose2Seg, deep-high-resolution-net ), however these models only include scripts for training and testing, this seems to be the norm in code written to implement models from research papers ,in deep-high-resolution-net I have tried to write a script to load the pretrained model and feed it my images, but the output I got was a bunch of tensors and I have no idea how to convert them to the .json annotations that I need.
total newbie here, sorry for my poor English in advance, ANY tips are appreciated.
I would include my script but its over 100 lines.
PS: is it polite to contact the authors and ask them if they can help?
because it seems a little distasteful.
Im not doing skeleton detection research, but your problem seems to be general.
(1) I dont think other people should teaching you from begining on how to load data and run their code from begining.
(2) For running other peoples code, just modify their test script which is provided e.g
https://github.com/leoxiaobin/deep-high-resolution-net.pytorch/blob/master/tools/test.py
They already helps you loaded the model
model = eval('models.'+cfg.MODEL.NAME+'.get_pose_net')(
cfg, is_train=False
)
if cfg.TEST.MODEL_FILE:
logger.info('=> loading model from {}'.format(cfg.TEST.MODEL_FILE))
model.load_state_dict(torch.load(cfg.TEST.MODEL_FILE), strict=False)
else:
model_state_file = os.path.join(
final_output_dir, 'final_state.pth'
)
logger.info('=> loading model from {}'.format(model_state_file))
model.load_state_dict(torch.load(model_state_file))
model = torch.nn.DataParallel(model, device_ids=cfg.GPUS).cuda()
Just call
# evaluate on Variable x with testing data
y = model(x)
# access Variable's tensor, copy back to CPU, convert to numpy
arr = y.data.cpu().numpy()
# write CSV
np.savetxt('output.csv', arr)
You should be able to open it in excel
(3) "convert them to the .json annotations that I need".
That's the problem nobody can help. We don't know what format you want. For their format, it can be obtained either by their paper. Or looking at their training data by
X, y = torch.load('some_training_set_with_labels.pt')
By correlating the x and y. Then you should have a pretty good idea.

best algorithm to predict 3 similar blogs based on a blog props and contents only

{
"blogid": 11,
"blog_authorid": 2,
"blog_content": "(this is blog complete content: html encoded on base64 such as) PHNlY3Rpb24+PGRpdiBjbGFzcz0icm93Ij4KICAgICAgICA8ZGl2IGNsYXNzPSJjb2wtc20tMTIiIGRhdGEtdHlwZT0iY29udGFpbmVyLWNvbnRlbn",
"blog_timestamp": "2018-03-17 00:00:00",
"blog_title": "Amazon India Fashion Week: Autumn-",
"blog_subtitle": "",
"blog_featured_img_link": "link to image",
"blog_intropara": "Introductory para to article",
"blog_status": 1,
"blog_lastupdated": "\"Mar 19, 2018 7:42:23 AM\"",
"blog_type": "Blog",
"blog_tags": "1,4,6",
"blog_uri": "Amazon-India-Fashion-Week-Autumn",
"blog_categories": "1",
"blog_readtime": "5",
"ViewsCount": 0
}
Above is one sample blog as per my API. I have a JsonArray of such blogs.
I am trying to predict 3 similar blogs based on a blog's props(eg: tags,categories,author,keywords in title/subtitle) and contents. I have no user data i.e, there is no logged in user data(such as rating or review). I know that without user's data it will not be accurate but I'm just getting started with data science or ML. Any suggestion/link is appreciated. I prefer using java but python,php or any other lang also works for me. I need an easy to implement model as I am a beginner. Thanks in advance.
My intuition is that this question might not be at the right address.
BUT
I would do the following:
Create a dataset of sites that would be an inventory from which to predict. For each site you will need to list one or more features: Amount of tags, amount of posts, average time between posts in days, etc.
Sounds like this is for training and you are not worried about accuracy
too much, numeric features should suffice.
Work back from a k-NN algorithm. Don't worry about the classifiers. Instead of classifying a blog, you list the 3 closest neighbors (k = 3). A good implementation of the algorithm is here. Have fun simplifying it for your purposes.
Your algorithm should be a step or two shorter than k-NN which is considered to be among simpler ML, a good place to start.
Good luck.
EDIT:
You want to build a recommender engine using text, tags, numeric and maybe time series data. This is a broad request. Just like you, when faced with this request, I’d need to dive in the data and research best approach. Some approaches require different sets of data. E.g. Collaborative vs Content-based filtering.
Few things may’ve been missed on the user side that can be used like a sort of rating: You do not need a login feature get information: Cookie ID or IP based DMA, GEO and viewing duration should be available to the Web Server.
On the Blog side: you need to process the texts to identify related terms. Other blog features I gave examples above.
I am aware that this is a lot of hand-waving, but there’s no actual code question here. To reiterate my intuition is that this question might not be at the right address.
I really want to help but this is the best I can do.
EDIT 2:
If I understand your new comments correctly, each blog has the following for each other blog:
A Jaccard similarity coefficient.
A set of TF-IDF generated words with
scores.
A Euclidean distance based on numeric data.
I would create a heuristic from these and allow the process to adjust the importance of each statistic.
The challenge would be to quantify the words-scores TF-IDF output. You can treat those (over a certain score) as tags and run another similarity analysis, or count overlap.
You already started on this path, and this answer assumes you are to continue. IMO best path is to see which dedicated recommender engines can help you without constructing statistics piecemeal (numeric w/ Euclidean, tags w/ Jaccard, Text w/ TF-IDF).

Support Vector Machine bad results-Python

I'm studying SVM and implemented this code , it's too basic,primitive and taking too much time but I just wanted to see how it actually works.Unfortunately,it is giving me bad results.What did I miss? Some coding error or mathematical mistakes? If you want to look at dataset , it's link here. I taked it from UCI Machine Learning Repository. Thanks for your deal.
def hypo(x,q):
return 1/(1+np.exp(-x.dot(q)))
data=np.loadtxt('LSVTVoice',delimiter='\t');
x=np.ones(data.shape)
x[:,1:]=data[:,0:data.shape[1]-1]
y=data[:,data.shape[1]-1]
q=np.zeros(data.shape[1])
C=0.002
##mean normalization
for i in range(q.size-1):
x[:,i+1]=(x[:,i+1]-x[:,i+1].mean())/(x[:,i+1].max()-x[:,i+1].min());
for i in range(2000):
h=x.dot(q)
for j in range(q.size):
q[j]=q[j]-(C*np.sum( -y*np.log(hypo(x,q))-(1-y)*np.log(1-hypo(x,q))) ) + (0.5*np.sum(q**2))
for i in range(y.size):
if h[i]>=0:
print y[i],'1'
else:
print y[i],'0'
Depending on your data, it's very usual that Simple Implementation of SVM give you bad result. You must try advanced version on SVM implementation(e.g Sickit SVM) you can also check this: https://github.com/scikit-learn/scikit-learn/tree/master/sklearn/svm
SVM has types of implementation and parameters like different kernels(e.g rbf). You must check them and try them with different parameter(depending on your data) and compare results to each other.
You can use Grid Search approach for comparing(check this: http://scikit-learn.org/stable/modules/grid_search.html)

Multiple, Binomial Dependent Variables for GLM (or LME4) in R

everyone. I'm pretty new to R. I've been trying to educate myself about this issue, but I've continued to run into road blocks.
I have a data set with two categorical, independent variables (habitat (1,2,3) and site (1,2,3,4,5). My response variables are the presence or absence of AFLP loci. I have 96 loci, and I want to determine which, if any, of these loci are significantly associated with habitat (site is a random effect). Each of the loci can be assumed to be independent from the others.
As far as relevancy to other researchers, this should be a problem that people trying to analyze molecular data with GLM or LME will begin to run into more.
Here is my code:
##Independent variables
Site=AFLP$Site ##AFLP is my data file
Habitat=AFLP$Habitat
##Dependent variable
Loci=AFLP[,4:99]
##Establishing matrix of variables
mydata <- cbind(Site, Habitat, Loci)
##glm
model1 <- glm(Loci ~ (1|Site)+Habitat, data=mydata, family="binomial")
I get this error:
Error in model.frame.default(formula = Loci ~ (1 | Site) + Habitat, data = mydata, :
invalid type (list) for variable 'Loci'
I know this error is associated with the data type of Loci; however, I've tried a bunch of things and still can't figure out how to correctly address the issue.
My problem seems to be similar to the ones in the below links, but again, I haven't been able to figure out how to apply this information to my data set.
http://stackoverflow.com/questions/18067519/using-r-to-do-a-regression-with-multiple-dependent-and-multiple-independent-vari
https://stats.stackexchange.com/questions/26585/how-to-do-a-generalized-linear-model-with-multiple-dependent-variables-in-r
Thank you in advance. If this turns out to have a simple answer, I apologize for taking up space. I have been Googling and trying to educate myself, and I haven't made any head-way.

failure to get p-values for lmer using lmerTest

I have run the following model using lmerTest and using lme4:
model2 = lmer(log(RT)~Group*A*B*C+(1|item)+(1+A+B+C|subject),data=dt)
Using lmerTest I get the following error when typing the summary() command:
> summary(model1)
Error in `colnames<-`(`*tmp*`, value = c("Estimate", "Std. Error", "df", :
length of 'dimnames' [2] not equal to array extent
I saw this has already been an issue for other users and that one user was able to bypass the issue running lsmeans().
When I tried lsmeans, I got the error:
Error in asMethod(object) : not a positive definite matrix.
I did not see any NAs when looking into the covariance matrix.
Note that I am able to run this model if I simply inverse the contrasts in the Group factor.
I have difficulties understanding why this is the case.
When I run the same model using lme4 and not lmerTest, I am able to get all the outputs of summary() but no p-values (as expected). pvals.fnc is discontinued in lme4 and I have not found an alternative yet. Plus it would be nice to have the p-values estimated in the same way for model2 as for the other models for which I was successfully able to use lmerTest.
Does anyone know what I should do at this point? Any help would be much appreciated!
If A or B or C are factors then you might get errors - such models are not yet supported by the lmerTest package (we will put the warning message together with the restrictions for such models in the help page)

Resources