I need a library to parse HTML, change some attributes of some elements, then write back result into HTML.
Is there a library for it?
In other languages (like PHP), there are DOM parsers. I found libraries for parsing HTML, but none of them allowed manipulation and generation (or I did not see it?).
Have you seen Floki?
Floki is a simple HTML parser that enables search for nodes using CSS selectors.
Its parsing capabilities seem decent but not sure about manipulation. It has a transform method but there are no usage examples.
Seems like there is not a library to parse and manipulate and write HTML.
There is this Eml that can write HTML, and although it has parser, it says it is not a multi-purpose one. I think I will do good for now, but for other usages, one might want to use another parser and Eml together to have a better solution.
PS. I will accept this answer if no one provides a better solution in a couple of days.
Related
For better Unit testing of my language, I want to test every rule separately.
However, the ParseHelpercan only parse input that fully corresponds to the defined grammar.
Consider a language like HTML. I want to test parsing paragraphs without having to nest them in html->head->body etc.
I think ANTLR offers similar possibilities.
Is this achievable in Xtext too?
Github issue https://github.com/eclipse/xtext-core/issues/16 addresses this feature. I guess the next release will do what you need.
I'm not talking about HTML tags, but tags used to describe blog posts, or youtube videos or questions on this site.
If I was crawling just a single website, I'd just use an xpath to extract the tag out, or even a regex if it's simple. But I'd like to be able to throw any web page at my extract_tags() function and get the tags listed.
I can imagine using some simple heuristics, like finding all HTML elements with id or class of 'tag', etc. However, this is pretty brittle and will probably fail for a huge number of web pages. What approach do you guys recommend for this problem?
Also, I'm aware of Zemanta and Open Calais, which both have ways to guess the tags for a piece of text, but that's not really the same as extracting tags real humans have already chosen. But I would still love to hear about any other services/APIs to guess the tags in a document.
EDIT: Just to be clear, a solution that already works for this would be great. But I'm guessing there's no open-source software that already does this, so I really just want to hear from people about possible approaches that could work for most cases. It need not be perfect.
EDIT2: For people suggesting a general solution that usually works is impossible, and that I must write custom scrapers for each website/engine, consider the arc90 readability tool. This tool is able to extract the article text for any given article on the web with surprising accuracy, using some sort of heuristic algorithm I believe. I have yet to dig into their approach, but it fits into a bookmarklet and does not seem too involved. I understand that extracting an article is probably simpler than extracting tags, but it should serve as an example of what's possible.
Systems like the arc90 example you give work by looking at things like the tag/text ratios and other heuristics. There is sufficent difference between the text content of the pages and the surrounding ads/menus etc. Other examples include tools that scrape emails or addresses. Here there are patterns that can be detected, locations that can be recognized. In the case of tags though you don't have much to help you uniqely distinguish a tag from normal text, its just a word or phrase like any other piece of text. A list of tags in a sidebar is very hard to distinguish from a navigation menu.
Some blogs like tumblr do have tags whose urls have the word "tagged" in them that you could use. Wordpress similarly has ".../tag/..." type urls for tags. Solutions like this would work for a large number of blogs independent of their individual page layout but they won't work everywhere.
If the sources expose their data as a feed (RSS/Atom) then you may be able to get the tags (or labels/categories/topics etc.) from this structured data.
Another option is to parse each web page and look for for tags formatted according to the rel=tag microformat.
Damn, was just going to suggest Open Calais. There's going to be no "great" way to do this. If you have some target platforms in mind, you could sniff for Wordpress, then see their link structure, and again for Flickr...
I think your only option is to write custom scripts for each site. To make things easier though you could look at AlchemyApi. They have simlar entity extraction capabilities as OpenCalais but they also have a "Structured Content Scraping" product which makes it a lot easier than writing xpaths by using simple visual constraints to identify pieces of a web page.
This is impossible because there isn't a well know, followed specification. Even different versions of the same engine could create different outputs - hey, using Wordpress a user can create his own markup.
If you're really interested in doing something like this, you should know it's going to be a real time consuming and ongoing project: you're going to create a lib that detects which "engine" is being used in a page, and parse it. If you can't detect a page for some reason, you create new rules to parse and move on.
I know this isn't the answer you're looking for, but I really can't see another option. I'm into Python, so I would use Scrapy for this since it's a complete framework for scraping: it's complete, well documented and really extensible.
Try making a Yahoo Pipe and running the source pages through the Term Extractor module. It may or may not give great results, but it's worth a try. Note - enable the V2 engine.
Looking at arc90 it seems they are also asking publishers to use semantically meaningful mark-up [see https://www.readability.com/publishers/guidelines/#view-exampleGuidelines] so they can parse it rather easily, but presumably they must either have developed a generic rules such as #dunelmtech suggested tag/text ratios, which can work with article detection, or they might be using with a combination of some text-segmentation algorithms (from Natural Language Processing field) such as TextTiler and C99 which could be quite usefull for article detection - see http://morphadorner.northwestern.edu/morphadorner/textsegmenter/ and google for more info on both [published in academic literature - google scholar].
It seems that, however, to detect "tags" as you required is a difficult problem (for already mentioned reasons in comments above). One approach I would try out would be to use one of the text-segmentation (C99 or TextTiler) algorithms to detect article start/end and then look for DIV's / SPAN's / ULs with CLASS & ID attributes containing ..tag.. in them, since in terms of page-layout's tags tend to be generally underneath the article and just above the comment feed this might work surprisingly well.
Anyway, would be interesting to see whether you got somewhere with the tag detection.
Martin
EDIT: I just found something that might really be helpfull. The algorithm is called VIPS [see: http://www.zjucadcg.cn/dengcai/VIPS/VIPS.html] and stands for Vision Based Page Segmentation. It is based on the idea that page content can be visually split into sections. Compared with DOM based methods, the segments obtained by VIPS are much more semantically aggregated. Noisy information, such as navigation, advertisement, and decoration can be easily removed because they are often placed in certain positions of a page. This could help you detect the tag block quite accurately!
there is a term extractor module in Drupal. (http://drupal.org/project/extractor) but it's only for Drupal 6.
Out of curiosity, I wonder what can people do with parsers, how they are applied, and what do people usually create with it?
I know it's widely used in programming language industry, however I think this is just a tiny portion of it, right?
Besides special-purpose languages, my most ambitious use of a parser generator yet (with good old yacc back in C, and again later with pyparsing in Python) was to extract, validate and possibly alter certain meta-info from SQL queries -- parsing SQL properly is a real challenge (especially if you hope to support more than one dialect!-), a parser generator (and a lexer it sits on top of) at least remove THAT part of the job!-)
They are used to parse text....
To give a more concrete example, where I work we use lexx/yacc to parse strings coming over sockets.
Also from the name it should give you an idea what javacc is used for (java compiler compiler!)
Generally to parse Domain Specific Languages or scripting languages, or similar support for code snipits.
Previously I have seen it used to parse the command line based output of another software tool. This way the outer tool (VPN software) could re-use the base router IPSec code without modification. As lots of what was being parsed was IP Route tables and other structured repeated text.
Using a parser allowed simple changes when the formatting changed, instead of trying to find and tweak the a hand written parser. And the output did change a few times of the life of the product.
I used parsers to help process +/- 800 Clipper source files into similar PRGs that could be compiled with Alaksa Xbase 32.
You can use it to extend your favorite language by getting its language definition from their repository and then adding what you've always wanted to have. You can pass the regular syntax to your application and handle the extension in your own program.
I have a question about parsing HTML pages, specificaly forums,
i want to parse a forum or thread containing certain post criterias, i havent defined the
algorithm yet, since i have only parsed structure text formats before,
A use case may be copy and paste each thread into the program by hand, or insert a URL like
http://www.forums.com/forum/showthread.php?t=46875&page=3 and let the program parse the pages
Given all this i would like to know:
Is it possible to parse a forum thread on a HTML page?
what would be the best/Fastest/easiest language for doing this?
If i prefer Java what tools/libraries do i need for this?
Any other thing i should consider?
1 / yes
2 / Use some compact language like python or ruby for prototyping.
For python there is a neat library for HTML/XML parsing called beautifulsoup
For ruby, you could try: nokogiri or hpricot
3 / A Java tool to consider: htmlparser
4 / If you are interested only in some particular text or some special classes, a regular expression might be sufficient. But as soon as you want to dig deeper into the structure of the content, you'll need some kind of model to hold your data, and hence a parser, which, in the best case, can cope with the occuring incosistencies of real world html.
You might want to look into some sort of html parsing library, rather than using regular expressions to do this. There are some really good html parsers for ruby and python, but a quick google shows there to be a number of parsers for java as well. The benefit of these libraries is that you don't have to handle every edge case with regular expressions/they handle malformed html (both of which can be impossible with regexes, depending on what you want to do) and they also give you a much way of dealing with the data (for example, beautiful soup lets you grab all elements which belong to a specific class or to use some other css selector to limit which page elements you want to deal with).
Personally, I would, at least for the beginning, start in ruby or python, as the libraries are known and there is a lot of info about using them for this purpose. Also, I find it easier to quickly prototype these types of things in ruby or python than in the jvm. You could even later bring that code onto the jvm with jruby or jython, if it becomes necessary.
yes
regular expressions, any flavor.
probably the ones w/regex
there are tools out there that will do this for you.
Can anyone (maybe an XSL-fan?) help me find any advantages with handling presentation of data on a web-page with XSL over ASP.NET MVC?
The two alternatives are:
ASP.NET (MVC/WebForms) with XSL
Getting the data from the database and transforming it to XML which is then displayed on the different pages with XSL-templates.
ASP.NET MVC
Getting the data from the database as C# objects (or LinqToSql/EF-objects) and displaying it with inline-code on MVC-pages.
The main benefit of XSL has been consistent display of data on many different pages, like WebControls. So, correct me if I'm wrong, ASP.NET MVC can be used the same way, but with strongly typed objects. Please help me see if there are any benefits to XSL.
I can see the main benefit of employing XSLT to transform your data and display it to the user would be the following:
The data is already in an XML format
The data follows a well defined schema (this makes using tools like XMLSpy much easier).
The data needs to be transformed into a number of different output formats, e.g. PDF, WMP and HTML
If this is to be the only output for your data, and it is not in XML format, then XSLT might not be the best solution.
Likewise if user interaction is required (such as editing of the data) then you will end up employing back-end code anyway to handle updates so might prove one technology too far...
I've always found two main issues when working with XML transformations:
Firstly they tend to be quite slow, the whole XML file must be parsed and validated before you can do anything with it. Being XML it's also excessively verbose, and therefore larger than it needs to be.
Secondly the way transformations work is a bit of a pain to code - custom tools like XmlSpy help, but it's still a different model to what most developers are used to.
At the moment MVC is very quick and looking very promising, but does suffer from the traditional web-development blight of <% and %> bee-stings all over your code. Using XML transformations avoids that, but is much harder to read and maintain.
I've used that technique in the past, and there are applications where we use it at my current place of employment. (I will admit, I am not totally a fan of it, but I'll play devil's advocate) Really that is one of the main advatages, and pushing this idea can be kinda neat. You're able to dynamically create the xsl on the fly and change the look and feel of the page on a whim. Is it possible to do this through the other methods...yes, but it's really easy to build a program to modify an xml/xsl document on the fly.
If you think of using XSL to transform one xml document to another and displaying it as html (which is really what you're doing), you're opening up your system to allow other programs to access the data on the page via XML. You can do this through the other methods, but using an xsl transformation forces it to output xml every time.
I would tread lightly with creating a system this way. You'll find a lot of pit falls you aren't expecting, and if you don't know xsl really really well, there is going to be a learning curve also.
Check this out if you want to use XSLT and ASP.MVC
http://www.bleevo.com/2009/06/aspnet-mvc-xslt-iviewengine/
Jafar Husain offers a few advantages in his proposal for Pretty XSL, primarily caching of the stylesheet to increase page load and reduce the size of your data. Steve Sanderson proposed a slightly different approach using JavaScript as the controller here.
Another, similar approach would be to use XForms, though the best support for it is through a JavaScript library.
If you only going to display data from DB XSL templates may be convenient solution, but if you gonna handle user interaction. Hm... I don't think it'll be maintainable at all.