Custom link function in h2o.glm - glm

I looked for a generalized linear model implementation with regularization. I found that glmnet does not allow custom link function. However, h2o takes link function type as a parameter. Is it possible to define and use a custom link function under a family (optimization problem is same) in h2o?

The answer depends on how badly you need it...
No: you have to use one of the built-in link functions, and cannot pass your own through the API.
Yes: it is open-source, so you could hack in your custom link function to the Java back-end, give it a name, recompile, and then it should be available from the API.
I poked around in the Java source a bit, hoping to be able to give you a single link in a single file where you add your code. But it looks more complicated than that, I think it is a non-trivial project. (If you have more budget than time, becoming a paying h2o.ai customer, and making a feature request is likely to be the quickest way to get it done!)

Related

Questions about application service design

I come from a mostly n-tier background, and I'm trying to move more towards a DDD architecture. I'm trying to find best practices for designing the application service, and after a few searches, am still left with a few questions. Granted, I know I can't be the first person to ask these questions, so if you know where these are answered, just point me the way and I'll happily close this.
Here are my main questions:
How "open" should your signatures be? For example, is it better to be more rigid with your signatures and use simple types as parameters when possible, or is it better to use objects (messages?) that can later be modified without breaking the signature?
If you want to expose variations of a signature, for example, a UserSearch method that returns a list of users based on various (and sometimes optional) search criteria, is it better to:
A. Use overloads
B. Use optional (or nullable) parameters
C. Break each scenario into its own unique method
D. Use messages
I know that some of these answers are subjective, and also depend on what all will be calling your application service. But I'm just trying to get a general direction of things to consider and other best practices at this point.
Thanks in advance.
Good questions. Thinking about the API is obviously important.
1) How open would depend for me would depend on who the consumers are. If this application service is only being used within the context of your own solution and/or team then I think it's fine to have specific messages (or rather their interfaces) or Dtos (data transfer objects). Though if as easy then keeping to simple types is best in my book and definitely better if being consumed by others. If they don't suffice then interfaced messages that provide only just enough. Again if you are going to be distributed to different platforms then simple messages of simple types is not a bad way to go.
2) Why not have a SearchCriteria object as a paramater? It could be a SearchCriteria message of simple types if you are looking at this as a start of a messaging bus.
As you say, your question is a little open but I'd be interested to hear more as it sounds like you are asking the right questions at least.
Jerad, those are tough questions to answer generally, as you noted.
My personal preference is to use primitives in method signatures where possible. If I need to pass 3+ primitives to a method, I define custom data transfer objects.
The thinking being: if multiple values are being passed together, it's likely they represent a concept in your problem space, and thus should become an object. For example, if you are passing X and Y coordinates to a method, I'd recommend creating a Point class or struct that represents that concept.
The only time I'd end up with variations on a signature, it would be to provide convenience methods that provide default values for method parameters. To continue the above example, a Draw method might not require a Point, in which case I'd use (0,0).
So, I'd answer #1 with "not very open" and #2 with A.
I hope that helps.

Things you look for when trying to understand new software

I wonder what sort of things you look for when you start working on an existing, but new to you, system? Let's say that the system is quite big (whatever it means to you).
Some of the things that were identified are:
Where is a particular subroutine or procedure invoked?
What are the arguments, results and predicates of a particular function?
How does the flow of control reach a particular location?
Where is a particular variable set, used or queried?
Where is a particular variable declared?
Where is a particular data object accessed, i.e. created, read, updated or deleted?
What are the inputs and outputs of a particular module?
But if you look for something more specific or any of the above questions is particularly important to you please share it with us :)
I'm particularly interested in something that could be extracted in dynamic analysis/execution.
I like to use a "use case" approach:
First, I ask myself "what's this software's purpose?": I try to identify how users are going to interact with the application;
Once I have some "use case", I try to understand what are the objects that are more involved and how they interact with other objects.
Once I did this, I draw a UML-type diagram that describe what I've just learned for further reference. What happens after depends on the task I've been assigned, i.e. modify the code, document the code etc.
There is the question of what motivation do I have for learning the new system:
Bug fix/minor enhancement - In this case, I may focus solely on that portion of the system that performs a specific function that needs to be altered. This is a way to break down a huge system but also is a way to identify if the issue is something I can fix or if it is something that I have to hand to the off-the-shelf company whose software we are using,e.g. a CRM, CMS, or ERP system can be a customized off-the-shelf system so there are many pieces to it.
Project work - This would be the other case and is where I'd probably try to build myself a view from 30,000 feet or so to know what are the high-level components and which areas of the system does the project impact. An example of this is where I'd join a company and work off of an existing code base but I don't have the luxury of having the small focus like in the previous case. Part of that view is to look for any patterns in the code in terms of naming conventions, project structure, etc. as this may be useful once I start changing some code in the system. I'd probably do some tracing through the system and try to see where are the uglier parts of the code. By uglier I mean those parts that are kludge-like and may have some spaghetti code as this was rushed when first written and is now being reworked heavily.
To my mind another way to view this is the question of whether I'm going to be spending days or weeks wrapping my head around a system like in the second case or should this be a case where it hopefully takes only a few hours, optimistically that is, to get my footing to make the necessary changes.

Fluent mapping verification for Entity Framework 4

Note: This is a follow-up question for this previous question of mine.
Inspired by this blog post, I'm trying to construct a fluent way to test my EF4 Code-Only mappings. However, I'm stuck almost instantly...
To be able to implement this, I also need to implement the CheckProperty method, and I'm quite unsure on how to save the parameters in the PersistenceSpecification class, and how to use them in VerifyTheMappings.
Also, I'd like to write tests for this class, but I'm not at all sure on how to accomplish that. What do I test? And how?
Any help is appreciated.
Update: I've taken a look at the implementation in Fluent NHibernate's source code, and it seems like it would be quite easy to just take the source and adapt it to Entity Framework. However, I can't find anything about modifying and using parts of the source in the BSD licence. Would copy-pasting their code into my project, and changing whatever I want to suit my needs, be legal for non-commercial private or open source projects? Would it be for commercial projects?
I was going to suggest looking at how FluentNH does this, until I got to your update. Anyway, you're already investigating that approach.
As to the portion of your question regarding the BSD license, I'd say the relevant part of the license is this: Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: [conditions follow].
From my reading of that line, you can modify (which would include the removal of any code not relevant to your use cases) the code however you wish, and redistribute it as long as you meet the author's conditions.
Since there are no qualifications on how you may use or redistribute the code or binaries, then you are free to do that however you wish, for any and all applications.
Here and here are descriptions of the license in layman's terms.
I'm always writing simple set of integration tests for each entity. Tests are persisting, selecting, updating and deleting entity. I thing there is no better and easier way to test your mapping and other features of the model (like cascade deletes).

Intelligently extracting tags from blogs and other web pages

I'm not talking about HTML tags, but tags used to describe blog posts, or youtube videos or questions on this site.
If I was crawling just a single website, I'd just use an xpath to extract the tag out, or even a regex if it's simple. But I'd like to be able to throw any web page at my extract_tags() function and get the tags listed.
I can imagine using some simple heuristics, like finding all HTML elements with id or class of 'tag', etc. However, this is pretty brittle and will probably fail for a huge number of web pages. What approach do you guys recommend for this problem?
Also, I'm aware of Zemanta and Open Calais, which both have ways to guess the tags for a piece of text, but that's not really the same as extracting tags real humans have already chosen. But I would still love to hear about any other services/APIs to guess the tags in a document.
EDIT: Just to be clear, a solution that already works for this would be great. But I'm guessing there's no open-source software that already does this, so I really just want to hear from people about possible approaches that could work for most cases. It need not be perfect.
EDIT2: For people suggesting a general solution that usually works is impossible, and that I must write custom scrapers for each website/engine, consider the arc90 readability tool. This tool is able to extract the article text for any given article on the web with surprising accuracy, using some sort of heuristic algorithm I believe. I have yet to dig into their approach, but it fits into a bookmarklet and does not seem too involved. I understand that extracting an article is probably simpler than extracting tags, but it should serve as an example of what's possible.
Systems like the arc90 example you give work by looking at things like the tag/text ratios and other heuristics. There is sufficent difference between the text content of the pages and the surrounding ads/menus etc. Other examples include tools that scrape emails or addresses. Here there are patterns that can be detected, locations that can be recognized. In the case of tags though you don't have much to help you uniqely distinguish a tag from normal text, its just a word or phrase like any other piece of text. A list of tags in a sidebar is very hard to distinguish from a navigation menu.
Some blogs like tumblr do have tags whose urls have the word "tagged" in them that you could use. Wordpress similarly has ".../tag/..." type urls for tags. Solutions like this would work for a large number of blogs independent of their individual page layout but they won't work everywhere.
If the sources expose their data as a feed (RSS/Atom) then you may be able to get the tags (or labels/categories/topics etc.) from this structured data.
Another option is to parse each web page and look for for tags formatted according to the rel=tag microformat.
Damn, was just going to suggest Open Calais. There's going to be no "great" way to do this. If you have some target platforms in mind, you could sniff for Wordpress, then see their link structure, and again for Flickr...
I think your only option is to write custom scripts for each site. To make things easier though you could look at AlchemyApi. They have simlar entity extraction capabilities as OpenCalais but they also have a "Structured Content Scraping" product which makes it a lot easier than writing xpaths by using simple visual constraints to identify pieces of a web page.
This is impossible because there isn't a well know, followed specification. Even different versions of the same engine could create different outputs - hey, using Wordpress a user can create his own markup.
If you're really interested in doing something like this, you should know it's going to be a real time consuming and ongoing project: you're going to create a lib that detects which "engine" is being used in a page, and parse it. If you can't detect a page for some reason, you create new rules to parse and move on.
I know this isn't the answer you're looking for, but I really can't see another option. I'm into Python, so I would use Scrapy for this since it's a complete framework for scraping: it's complete, well documented and really extensible.
Try making a Yahoo Pipe and running the source pages through the Term Extractor module. It may or may not give great results, but it's worth a try. Note - enable the V2 engine.
Looking at arc90 it seems they are also asking publishers to use semantically meaningful mark-up [see https://www.readability.com/publishers/guidelines/#view-exampleGuidelines] so they can parse it rather easily, but presumably they must either have developed a generic rules such as #dunelmtech suggested tag/text ratios, which can work with article detection, or they might be using with a combination of some text-segmentation algorithms (from Natural Language Processing field) such as TextTiler and C99 which could be quite usefull for article detection - see http://morphadorner.northwestern.edu/morphadorner/textsegmenter/ and google for more info on both [published in academic literature - google scholar].
It seems that, however, to detect "tags" as you required is a difficult problem (for already mentioned reasons in comments above). One approach I would try out would be to use one of the text-segmentation (C99 or TextTiler) algorithms to detect article start/end and then look for DIV's / SPAN's / ULs with CLASS & ID attributes containing ..tag.. in them, since in terms of page-layout's tags tend to be generally underneath the article and just above the comment feed this might work surprisingly well.
Anyway, would be interesting to see whether you got somewhere with the tag detection.
Martin
EDIT: I just found something that might really be helpfull. The algorithm is called VIPS [see: http://www.zjucadcg.cn/dengcai/VIPS/VIPS.html] and stands for Vision Based Page Segmentation. It is based on the idea that page content can be visually split into sections. Compared with DOM based methods, the segments obtained by VIPS are much more semantically aggregated. Noisy information, such as navigation, advertisement, and decoration can be easily removed because they are often placed in certain positions of a page. This could help you detect the tag block quite accurately!
there is a term extractor module in Drupal. (http://drupal.org/project/extractor) but it's only for Drupal 6.

Creating design document from existing java code

I have existing java code and need to create Design Document based on that.
For starter even if I could get all functions with input / output parameters that will help in overall proces.
Note: There is not commeted documentation on any procedures, function or classes.
Last but not least. Let me know for any good tool which will reduce time required for this phase. As currently we write every flow and related stuffs.
What you want is just too much. Quoting Linus Torvalds: “Good code is its own best documentation.”. Anyway, I digress.
You might want to look into UML tools which generate class/sequence diagrams from the code. There are many of them but only a handful support reverse engineering (into and from the class diagram), and even fewer subset support the same to/from sequence diagram. I only know MagicDraw could do this, but I am biased as I used to work for the manufacturer of this tool so do your shopping around first.
Use java docs: http://www.oracle.com/technetwork/java/javase/documentation/index-137868.html
or Introspection: http://docs.oracle.com/javase/tutorial/reflect/class/classMembers.html

Resources