Reporting tools that allow to abstract the document's actual geometry

Reporting tools that allow to abstract the document's actual geometry - latex

We are using Jasper Reports to generate reports from our application, based on Oracle DBMS.
The thing works fine, but it is likely we're going to use different paper formats, languages and orientations for the same document, or to add columns and other elements, or have the elements' contents change size.
Doing this in iReport/Jasper isn't easy AFAIU.
If something doesn't work you have to move or resize elements by hand, checking they're of appopriate size and position.
When I was a student I would use LaTeX for typesetting and it could easily handle "reshaping" well. Isn't there something like that?
I heard BRIT doesn't follow the "pixel position" paradigim of Jasper and Pentaho, and as such it strive to handle positioning and, possibily sizing, alone, after the user had specified the document's abstract structure, i.e. what elements are there and their relative position.
EDIT
Forgot to mention: we are looking for a solution that involves as little code as possible. The reasons are manifold, but the most important are:
first: avoid learning another library (we managed to stay away from Jasper's and liked it).
second: giving a tool that even those that aren't programmers, or hardcore ones, could manage.
The lower the entry barrier the better.
For example I know people in the humanities that can pick up LaTeX decently. They could even digest iReport. I don't know of anyone who can do the same with real-world Java.

Related

how to implement redo in lua

I have just been introduced to the LUA language, and I am embarking on my first project. However, the biggest challenge I am facing now is how to implement or make an Undo and Redo.
However, to make issues clear, the project is a Custom Text Editor, and as a result, the Undo/Redo here is required for editing any input text. I have manage to handle issues like Cut,Copy, Clear, Find Word, as well as Changing Font, Text Colour, inserting tables and images among others, and all these were handled in the lua language. Obviously, there are several of the custom text editors, i believe the effort to cater for many will pave the way for future advancements or improvements. But the Undo/Redo actions are tearing me apart, which from my research is even the lack by most of the existing custom text editors.
I have searched several forums where they all seem to give the tip of using an associative kind of table to load the information, and retrieve them from there. Unbelievably, i think some of these sites are just sharing their knowledge acquired from other sites without any technical view point or whatsoever. This is because, most of the suggestions i come across seem to look alike and the same in all aspect. For about tens of sites visited, there is non where a user has tried to post an example, but all i see is the same complain about the majority of lua users. Undoubtedly, this will seem a bit easy to some respected gurus in this forum.
I don't seem to get the true picture of the suggestions.
Can someone provide me with an example?

Undo/redo is a perfect fit for command pattern.
First you need to write some of the text manipulation functionality per se. Just the do part, without worrying about un- or re-. That will be lots of work in itself.
You will then have a bunch of functions to manipulate your document. Things like insertText(), setFont(), insertJpgImage() and such. The trick is that now you need to wrap each of this functions in a so called command object. Each command class must have a method to do() itself, and to undo() itself.
Now that all your text manipulation operations are represented by command objects, you execute each operation (e.g. bold some text) by something like:
boldCommand = setTextPropertyCommand:new(document, selectedArea, textProperties.bold)
boldCommand:do() --actually modify text
table.insert(commandUndoStack, boldCommand) --keep the command for possible undoing later.
When you want to undo the bolding of some text you can then call:
command = table.remove(commandUndoStack)
command:undo()
NB, if your are using some GUI framework binding in Lua, then it might be the case that this framework has its own readymade undo/redo functionality. For example Qt (with qtlua bindings) offers QUndoStack class.

Why are CSS3 PIE and other similar scripts not in use everywhere?

This question has already been asked at htc files: Why not to use them?, but the answer didn't answer anything really.
The question is, why is something like CSS3 PIE
not in use on many sites? I'd expect smaller ones to not know about it, but the one that caught my eye was Twitter, who doesn't use it.
Is it because it's not standard? Or does it cause a noticeable slow-down of the site?
Thank you for any responses.

I can't speak for everyone, but my sense is that you don't see tools like these in use on large sites because:
1) They do incur a certain performance cost. CSS3 PIE in particular starts to create a noticeable rendering delay after use on about two dozen elements (in my experience, YMMV.) For that reason its use on large pages might cause a larger rendering delay than the time saved downloading image assets.
2) They start to show bugs with complex DOM changes. Lots of animation, showing/hiding, etc. can sometimes cause PIE to get out of sync.
3) Related to #2, the added layer of abstraction (and its associated bugs) can become a detriment on large development teams with a complex codebase. If you start spending more time debugging the abstraction than it would take to simply create rounded corner images, then the tool is getting in the way.
I'm speaking specifically about CSS3 PIE here because it's near and dear to me (I'm its creator), but similar caveats apply to other polyfills like Selectivizr. This goes for any tool: you always have to evaluate the pros/cons for your specific needs. For example I wouldn't recommend PIE for a high-traffic, performance-critical, highly interactive site like Twitter for the reasons stated above, but it really shines on simpler more static designs.
...Another thought is that it's perfectly valid in many cases to simply let IE degrade to square corners etc. This is always the preferred approach IMO, if possible given your particular situation. So in that case it's not due to any evaluation of the tool, but just a decision that what the tool provides is simply not needed in the first place. :)

Do semantics matter in Latex? If not, why not?

When I ask questions about achieving some particular layout LaTeX, I get answers that suggest I should use constructs that don't make sense for their semantics. For example, I wanted to intent a single paragraph, and I was told to make it a list with no bullets. It works, but that isn't the semantic meaning of a list, so why is it acceptable to abuse it like that.
We stopped doing it in HTML over a decade ago. Why are we still doing the equivalent of table layout in supposedly the best typesetting system there is?
Am I not getting it, but isn't this a little inelegant? Everyone says LaTeX is elegant and that you don't need to worry about layout, but then I find myself contorting tables, lists and other semantic markup to put stuff where I want it. Does the emperor have no clothes, or am I not getting it?

When a problem like this comes along, and the answer is to use something that doesn't really make semantic sense, what you should do is create a new environment or command that wraps the functionality in a way that makes semantic sense.
Every layout language has this problem -- somewhere along the line, you need to get down to a physical, non-semantic solution. In HTML, the non-semantic parts of the solution are now pretty-well covered by by CSS and JavaScript (which are different languages from HTML). You create <div>s and <span>s that capture the semantics, and then you use CSS and JavaScript to define the physical layout for those semantics.
In LaTeX, you simply wind up using the exact same language for this purpose: LaTeX (or plain TeX, which is often hard to differentiate from LaTeX).

I'd say it's all a matter of knowing or finding the right semantics. You talk about a single example, and you don't provide your semantics, you talk about the way to layout it. So depending on what it is that you want indented, there might be better fits, e.g. a quote, a formula, etc.

Things you look for when trying to understand new software

I wonder what sort of things you look for when you start working on an existing, but new to you, system? Let's say that the system is quite big (whatever it means to you).
Some of the things that were identified are:
Where is a particular subroutine or procedure invoked?
What are the arguments, results and predicates of a particular function?
How does the flow of control reach a particular location?
Where is a particular variable set, used or queried?
Where is a particular variable declared?
Where is a particular data object accessed, i.e. created, read, updated or deleted?
What are the inputs and outputs of a particular module?
But if you look for something more specific or any of the above questions is particularly important to you please share it with us :)
I'm particularly interested in something that could be extracted in dynamic analysis/execution.

I like to use a "use case" approach:
First, I ask myself "what's this software's purpose?": I try to identify how users are going to interact with the application;
Once I have some "use case", I try to understand what are the objects that are more involved and how they interact with other objects.
Once I did this, I draw a UML-type diagram that describe what I've just learned for further reference. What happens after depends on the task I've been assigned, i.e. modify the code, document the code etc.

There is the question of what motivation do I have for learning the new system:
Bug fix/minor enhancement - In this case, I may focus solely on that portion of the system that performs a specific function that needs to be altered. This is a way to break down a huge system but also is a way to identify if the issue is something I can fix or if it is something that I have to hand to the off-the-shelf company whose software we are using,e.g. a CRM, CMS, or ERP system can be a customized off-the-shelf system so there are many pieces to it.
Project work - This would be the other case and is where I'd probably try to build myself a view from 30,000 feet or so to know what are the high-level components and which areas of the system does the project impact. An example of this is where I'd join a company and work off of an existing code base but I don't have the luxury of having the small focus like in the previous case. Part of that view is to look for any patterns in the code in terms of naming conventions, project structure, etc. as this may be useful once I start changing some code in the system. I'd probably do some tracing through the system and try to see where are the uglier parts of the code. By uglier I mean those parts that are kludge-like and may have some spaghetti code as this was rushed when first written and is now being reworked heavily.
To my mind another way to view this is the question of whether I'm going to be spending days or weeks wrapping my head around a system like in the second case or should this be a case where it hopefully takes only a few hours, optimistically that is, to get my footing to make the necessary changes.

Intelligently extracting tags from blogs and other web pages

I'm not talking about HTML tags, but tags used to describe blog posts, or youtube videos or questions on this site.
If I was crawling just a single website, I'd just use an xpath to extract the tag out, or even a regex if it's simple. But I'd like to be able to throw any web page at my extract_tags() function and get the tags listed.
I can imagine using some simple heuristics, like finding all HTML elements with id or class of 'tag', etc. However, this is pretty brittle and will probably fail for a huge number of web pages. What approach do you guys recommend for this problem?
Also, I'm aware of Zemanta and Open Calais, which both have ways to guess the tags for a piece of text, but that's not really the same as extracting tags real humans have already chosen. But I would still love to hear about any other services/APIs to guess the tags in a document.
EDIT: Just to be clear, a solution that already works for this would be great. But I'm guessing there's no open-source software that already does this, so I really just want to hear from people about possible approaches that could work for most cases. It need not be perfect.
EDIT2: For people suggesting a general solution that usually works is impossible, and that I must write custom scrapers for each website/engine, consider the arc90 readability tool. This tool is able to extract the article text for any given article on the web with surprising accuracy, using some sort of heuristic algorithm I believe. I have yet to dig into their approach, but it fits into a bookmarklet and does not seem too involved. I understand that extracting an article is probably simpler than extracting tags, but it should serve as an example of what's possible.

Systems like the arc90 example you give work by looking at things like the tag/text ratios and other heuristics. There is sufficent difference between the text content of the pages and the surrounding ads/menus etc. Other examples include tools that scrape emails or addresses. Here there are patterns that can be detected, locations that can be recognized. In the case of tags though you don't have much to help you uniqely distinguish a tag from normal text, its just a word or phrase like any other piece of text. A list of tags in a sidebar is very hard to distinguish from a navigation menu.
Some blogs like tumblr do have tags whose urls have the word "tagged" in them that you could use. Wordpress similarly has ".../tag/..." type urls for tags. Solutions like this would work for a large number of blogs independent of their individual page layout but they won't work everywhere.

If the sources expose their data as a feed (RSS/Atom) then you may be able to get the tags (or labels/categories/topics etc.) from this structured data.
Another option is to parse each web page and look for for tags formatted according to the rel=tag microformat.

Damn, was just going to suggest Open Calais. There's going to be no "great" way to do this. If you have some target platforms in mind, you could sniff for Wordpress, then see their link structure, and again for Flickr...

I think your only option is to write custom scripts for each site. To make things easier though you could look at AlchemyApi. They have simlar entity extraction capabilities as OpenCalais but they also have a "Structured Content Scraping" product which makes it a lot easier than writing xpaths by using simple visual constraints to identify pieces of a web page.

This is impossible because there isn't a well know, followed specification. Even different versions of the same engine could create different outputs - hey, using Wordpress a user can create his own markup.
If you're really interested in doing something like this, you should know it's going to be a real time consuming and ongoing project: you're going to create a lib that detects which "engine" is being used in a page, and parse it. If you can't detect a page for some reason, you create new rules to parse and move on.
I know this isn't the answer you're looking for, but I really can't see another option. I'm into Python, so I would use Scrapy for this since it's a complete framework for scraping: it's complete, well documented and really extensible.

Try making a Yahoo Pipe and running the source pages through the Term Extractor module. It may or may not give great results, but it's worth a try. Note - enable the V2 engine.

Looking at arc90 it seems they are also asking publishers to use semantically meaningful mark-up [see https://www.readability.com/publishers/guidelines/#view-exampleGuidelines] so they can parse it rather easily, but presumably they must either have developed a generic rules such as #dunelmtech suggested tag/text ratios, which can work with article detection, or they might be using with a combination of some text-segmentation algorithms (from Natural Language Processing field) such as TextTiler and C99 which could be quite usefull for article detection - see http://morphadorner.northwestern.edu/morphadorner/textsegmenter/ and google for more info on both [published in academic literature - google scholar].
It seems that, however, to detect "tags" as you required is a difficult problem (for already mentioned reasons in comments above). One approach I would try out would be to use one of the text-segmentation (C99 or TextTiler) algorithms to detect article start/end and then look for DIV's / SPAN's / ULs with CLASS & ID attributes containing ..tag.. in them, since in terms of page-layout's tags tend to be generally underneath the article and just above the comment feed this might work surprisingly well.
Anyway, would be interesting to see whether you got somewhere with the tag detection.
Martin
EDIT: I just found something that might really be helpfull. The algorithm is called VIPS [see: http://www.zjucadcg.cn/dengcai/VIPS/VIPS.html] and stands for Vision Based Page Segmentation. It is based on the idea that page content can be visually split into sections. Compared with DOM based methods, the segments obtained by VIPS are much more semantically aggregated. Noisy information, such as navigation, advertisement, and decoration can be easily removed because they are often placed in certain positions of a page. This could help you detect the tag block quite accurately!

there is a term extractor module in Drupal. (http://drupal.org/project/extractor) but it's only for Drupal 6.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart