When I ask questions about achieving some particular layout LaTeX, I get answers that suggest I should use constructs that don't make sense for their semantics. For example, I wanted to intent a single paragraph, and I was told to make it a list with no bullets. It works, but that isn't the semantic meaning of a list, so why is it acceptable to abuse it like that.
We stopped doing it in HTML over a decade ago. Why are we still doing the equivalent of table layout in supposedly the best typesetting system there is?
Am I not getting it, but isn't this a little inelegant? Everyone says LaTeX is elegant and that you don't need to worry about layout, but then I find myself contorting tables, lists and other semantic markup to put stuff where I want it. Does the emperor have no clothes, or am I not getting it?
When a problem like this comes along, and the answer is to use something that doesn't really make semantic sense, what you should do is create a new environment or command that wraps the functionality in a way that makes semantic sense.
Every layout language has this problem -- somewhere along the line, you need to get down to a physical, non-semantic solution. In HTML, the non-semantic parts of the solution are now pretty-well covered by by CSS and JavaScript (which are different languages from HTML). You create <div>s and <span>s that capture the semantics, and then you use CSS and JavaScript to define the physical layout for those semantics.
In LaTeX, you simply wind up using the exact same language for this purpose: LaTeX (or plain TeX, which is often hard to differentiate from LaTeX).
I'd say it's all a matter of knowing or finding the right semantics. You talk about a single example, and you don't provide your semantics, you talk about the way to layout it. So depending on what it is that you want indented, there might be better fits, e.g. a quote, a formula, etc.
Related
I'm doing a project which involves parsing the histories of common lisp repos. I need to parse them into list-of-lists or something like that. Ideally, I'd like to preserve as much of the original source file syntax as possible, in some way. For example, in the case of the text #+sbcl <something>, which I think means "If our current lisp is sbcl, read <something>, otherwise skip it", I'd like to get something like (#+ 'sbcl <something>).
I originally wrote a LALR parser in Python, which sort of worked, but it's not ideal for many reasons. I'm having a lot of difficulty getting correct output, and I have tons of special cases to add.
I figured that what I should really do is is use lisp itself, since it already has a lisp parser built in. If I could just read a file into sexps, I could dump it into something (cl-json would do) for further processing down the line.
Unfortunately, when I attempt to read https://github.com/fukamachi/woo/blob/master/src/woo.lisp, I get the error
There is no package with the name WOO.EV.TCP
which is of course coming from line 80 of that file, since that package is defined in src/ev/tcp.lisp, and we haven't read it.
Basically, is it possible to just read the file into sexps without caring whether the packages are defined or if they contain the relevant symbols? If so, how? I've tried looking at the hyperspec reader documentation, but I don't see anything that sounds relevant.
I'm out of practice with actually writing common lisp, but it seems potentially possible to hack around this by handling the undefined package condition by creating a blank package with that name, and handling the no-symbol-of-that-name-in-package condition by just interning a given symbol. I think. I don't know how to actually do this, I don't know if it would work, I don't know how many special cases would be involved. Offhand, the first condition is called no-such-package, but the second one (at least in sbcl) is called simple-error, so I don't even know how to determine whether this particular simple-error is the no-such-symbol-in-that-package error, let alone how to extract the relevant names from the condition, fix it, and restart. I'd really like to hear from a common lisp expert that this is the right thing to do here before I go down the road of trying to do it this way, because it will involve a lot of learning.
It also occurs to me that I could fix this by just sed-ing the file before reading it. E.g. turning woo.ev.tcp:start-listening-socket into, say, woo.ev.tcp===start-listening-socket. I don't particularly like this solution, and it's not clear that I wouldn't run into tons more ugly special cases, but it might work if there's no better answer.
I am almost sure there is no easy portable way to do this for a number of reasons.
(Just limiting things to the non-existent-package problem for now.)
First of all there is no portable access into the bit of the reader which decides that tokens are going to be symbols and then looks for package markers &c: that just happens according to the rules in 2.3. So you can't easily intervene in this.
Secondly there's not portably enough information in any kind of condition the reader might signal to be able to handle them.
There are several possible ways out of this bit of the problem.
If you felt sufficiently heroic you might be able to teach the reader that all of the token-starting characters are in fact things you control and then write a token-reader that somehow deals with the whole package thing by returning some object which isn't a symbol. But to do that you need to deal with numbers, and if you think that's simple, well, it's not.
If you felt less heroic you could write a more primitive token-reader which just doesn't even try to deal with anything except grabbing all the characters needed and returns some kind of object which wraps a string. This would avoid the whole number problem at the cost of losing a lot of intofmration.
If you don't care about portability, find an implementation, understand how its reader does it, and muck around with it. There are more open source or source-available implementations than I can easily count (perhaps I am not very good at counting) so this is a pretty good approach. It's certainly what I'd do.
But this is only the start of the problems. The CL reader is hairy and, in its standard configuration (the configuration which is used for things like compile-file unless people have arranged otherwise) can run completely arbitrary code at read time, including code which modifies the reader itself, some of which may do so in an implementation-dependent way. And people use this: there's a reason Lisp is called the 'programmable programming language' and it's that people program it.
I've decided to solve this using sed (actually Python's re.sub, but who's counting?) because it'll work for my actual use case, and was easy.
For future readers: The various people saying this is impossible in general are probably right. The other questions posted by #Svante look like good easy ways to solve part of the problem. Other parts of the problem might be solved more elegantly by replacing the reader macros for #., #+, #-, etc with ones which just make a list, which sounds less heroic than the suggestions from #tfb, but I don't have time for that shit.
As you know, if we use the form helper something like select_tag all the generated HTML's indent comes back to the very left.
Then generated HTML source looks ugly.
How do you handle this problem?
and does it matter to SEO?
As long as its valid HTML then search engines won't care about your whitespace/layout. Infact they can handle even invalid mark up to be honest. You don't get bonus points for having a clear HTML layout though I'm afraid.
Nice layouts make development/debugging easier and you should at least try to make it look pretty. But then using code generators like this it not always possible.
Focus on getting it working well and worry about prettying it up later if you have time ;)
I'm sure there is a way to modify the output if Rails is still as awesome as it used to be last time I had a play with it.
I use the erector gem to create my views, erector is basically an HTML-DSL in plain ruby which, among other things, allows to pretty_print your HTML. This is usually just done during development, as the additional whitespace would just increase the page size, burn bandwidth and slightly slow down the browser.
For this reason I cannot imagine that search-engines care at all. They just might have to do a little more work, so they might rank it a little lower, but I even doubt that much. They might care for valid HTML, that is another advantage of erector (and other templating languages as HAML and the like).
We are using Jasper Reports to generate reports from our application, based on Oracle DBMS.
The thing works fine, but it is likely we're going to use different paper formats, languages and orientations for the same document, or to add columns and other elements, or have the elements' contents change size.
Doing this in iReport/Jasper isn't easy AFAIU.
If something doesn't work you have to move or resize elements by hand, checking they're of appopriate size and position.
When I was a student I would use LaTeX for typesetting and it could easily handle "reshaping" well. Isn't there something like that?
I heard BRIT doesn't follow the "pixel position" paradigim of Jasper and Pentaho, and as such it strive to handle positioning and, possibily sizing, alone, after the user had specified the document's abstract structure, i.e. what elements are there and their relative position.
EDIT
Forgot to mention: we are looking for a solution that involves as little code as possible. The reasons are manifold, but the most important are:
first: avoid learning another library (we managed to stay away from Jasper's and liked it).
second: giving a tool that even those that aren't programmers, or hardcore ones, could manage.
The lower the entry barrier the better.
For example I know people in the humanities that can pick up LaTeX decently. They could even digest iReport. I don't know of anyone who can do the same with real-world Java.
I have just been introduced to the LUA language, and I am embarking on my first project. However, the biggest challenge I am facing now is how to implement or make an Undo and Redo.
However, to make issues clear, the project is a Custom Text Editor, and as a result, the Undo/Redo here is required for editing any input text. I have manage to handle issues like Cut,Copy, Clear, Find Word, as well as Changing Font, Text Colour, inserting tables and images among others, and all these were handled in the lua language. Obviously, there are several of the custom text editors, i believe the effort to cater for many will pave the way for future advancements or improvements. But the Undo/Redo actions are tearing me apart, which from my research is even the lack by most of the existing custom text editors.
I have searched several forums where they all seem to give the tip of using an associative kind of table to load the information, and retrieve them from there. Unbelievably, i think some of these sites are just sharing their knowledge acquired from other sites without any technical view point or whatsoever. This is because, most of the suggestions i come across seem to look alike and the same in all aspect. For about tens of sites visited, there is non where a user has tried to post an example, but all i see is the same complain about the majority of lua users. Undoubtedly, this will seem a bit easy to some respected gurus in this forum.
I don't seem to get the true picture of the suggestions.
Can someone provide me with an example?
Undo/redo is a perfect fit for command pattern.
First you need to write some of the text manipulation functionality per se. Just the do part, without worrying about un- or re-. That will be lots of work in itself.
You will then have a bunch of functions to manipulate your document. Things like insertText(), setFont(), insertJpgImage() and such. The trick is that now you need to wrap each of this functions in a so called command object. Each command class must have a method to do() itself, and to undo() itself.
Now that all your text manipulation operations are represented by command objects, you execute each operation (e.g. bold some text) by something like:
boldCommand = setTextPropertyCommand:new(document, selectedArea, textProperties.bold)
boldCommand:do() --actually modify text
table.insert(commandUndoStack, boldCommand) --keep the command for possible undoing later.
When you want to undo the bolding of some text you can then call:
command = table.remove(commandUndoStack)
command:undo()
NB, if your are using some GUI framework binding in Lua, then it might be the case that this framework has its own readymade undo/redo functionality. For example Qt (with qtlua bindings) offers QUndoStack class.
I'm not talking about HTML tags, but tags used to describe blog posts, or youtube videos or questions on this site.
If I was crawling just a single website, I'd just use an xpath to extract the tag out, or even a regex if it's simple. But I'd like to be able to throw any web page at my extract_tags() function and get the tags listed.
I can imagine using some simple heuristics, like finding all HTML elements with id or class of 'tag', etc. However, this is pretty brittle and will probably fail for a huge number of web pages. What approach do you guys recommend for this problem?
Also, I'm aware of Zemanta and Open Calais, which both have ways to guess the tags for a piece of text, but that's not really the same as extracting tags real humans have already chosen. But I would still love to hear about any other services/APIs to guess the tags in a document.
EDIT: Just to be clear, a solution that already works for this would be great. But I'm guessing there's no open-source software that already does this, so I really just want to hear from people about possible approaches that could work for most cases. It need not be perfect.
EDIT2: For people suggesting a general solution that usually works is impossible, and that I must write custom scrapers for each website/engine, consider the arc90 readability tool. This tool is able to extract the article text for any given article on the web with surprising accuracy, using some sort of heuristic algorithm I believe. I have yet to dig into their approach, but it fits into a bookmarklet and does not seem too involved. I understand that extracting an article is probably simpler than extracting tags, but it should serve as an example of what's possible.
Systems like the arc90 example you give work by looking at things like the tag/text ratios and other heuristics. There is sufficent difference between the text content of the pages and the surrounding ads/menus etc. Other examples include tools that scrape emails or addresses. Here there are patterns that can be detected, locations that can be recognized. In the case of tags though you don't have much to help you uniqely distinguish a tag from normal text, its just a word or phrase like any other piece of text. A list of tags in a sidebar is very hard to distinguish from a navigation menu.
Some blogs like tumblr do have tags whose urls have the word "tagged" in them that you could use. Wordpress similarly has ".../tag/..." type urls for tags. Solutions like this would work for a large number of blogs independent of their individual page layout but they won't work everywhere.
If the sources expose their data as a feed (RSS/Atom) then you may be able to get the tags (or labels/categories/topics etc.) from this structured data.
Another option is to parse each web page and look for for tags formatted according to the rel=tag microformat.
Damn, was just going to suggest Open Calais. There's going to be no "great" way to do this. If you have some target platforms in mind, you could sniff for Wordpress, then see their link structure, and again for Flickr...
I think your only option is to write custom scripts for each site. To make things easier though you could look at AlchemyApi. They have simlar entity extraction capabilities as OpenCalais but they also have a "Structured Content Scraping" product which makes it a lot easier than writing xpaths by using simple visual constraints to identify pieces of a web page.
This is impossible because there isn't a well know, followed specification. Even different versions of the same engine could create different outputs - hey, using Wordpress a user can create his own markup.
If you're really interested in doing something like this, you should know it's going to be a real time consuming and ongoing project: you're going to create a lib that detects which "engine" is being used in a page, and parse it. If you can't detect a page for some reason, you create new rules to parse and move on.
I know this isn't the answer you're looking for, but I really can't see another option. I'm into Python, so I would use Scrapy for this since it's a complete framework for scraping: it's complete, well documented and really extensible.
Try making a Yahoo Pipe and running the source pages through the Term Extractor module. It may or may not give great results, but it's worth a try. Note - enable the V2 engine.
Looking at arc90 it seems they are also asking publishers to use semantically meaningful mark-up [see https://www.readability.com/publishers/guidelines/#view-exampleGuidelines] so they can parse it rather easily, but presumably they must either have developed a generic rules such as #dunelmtech suggested tag/text ratios, which can work with article detection, or they might be using with a combination of some text-segmentation algorithms (from Natural Language Processing field) such as TextTiler and C99 which could be quite usefull for article detection - see http://morphadorner.northwestern.edu/morphadorner/textsegmenter/ and google for more info on both [published in academic literature - google scholar].
It seems that, however, to detect "tags" as you required is a difficult problem (for already mentioned reasons in comments above). One approach I would try out would be to use one of the text-segmentation (C99 or TextTiler) algorithms to detect article start/end and then look for DIV's / SPAN's / ULs with CLASS & ID attributes containing ..tag.. in them, since in terms of page-layout's tags tend to be generally underneath the article and just above the comment feed this might work surprisingly well.
Anyway, would be interesting to see whether you got somewhere with the tag detection.
Martin
EDIT: I just found something that might really be helpfull. The algorithm is called VIPS [see: http://www.zjucadcg.cn/dengcai/VIPS/VIPS.html] and stands for Vision Based Page Segmentation. It is based on the idea that page content can be visually split into sections. Compared with DOM based methods, the segments obtained by VIPS are much more semantically aggregated. Noisy information, such as navigation, advertisement, and decoration can be easily removed because they are often placed in certain positions of a page. This could help you detect the tag block quite accurately!
there is a term extractor module in Drupal. (http://drupal.org/project/extractor) but it's only for Drupal 6.