Create csv from html pages

Create csv from html pages - parsing

There is a website that displays a lot of data in html tables. They have paged the data so there are around 500 pages.
What is the most convenint (easy) way of getting the data in those tables and download it a CSV, on Windows?
Basically I need to write a script that does something like this but is overkilling to write in in C# and I am looking for other solutions that people with web experience use:
for(i=1 to 500)
load page from http://x/page_i.html;
parse the source and get the data in table with id='data'
save results in csv
Thanks!

I was doing a screen-scraping application once and found BeautifulSoup to be very useful. You could easily plop that into a Python script and parse across all the tags with the specific id you're looking for.

The easiest non-C# way I can think of is to use Wget to download the page, then run HTMLTidy to convert it to XML/XHTML and then transform the resulting XML to CSV with an XSLT (run with MSXSL.exe)
You will have to write some simple batch files and an XSLT with a basic XPath selector.
If you feel it would be easier to just do it in C#, you can use SgmlReader to read the HTML DOM and do an XPath query to extract the data. It should not take more than about 20 lines of code.

Related

In Asp.net using C#, how can I convert ==head== italic and others to html equivalents?

I was able to save the following in database
==heading===
Some __useful contents__ and **bold texts** here
I want to output ==heading== to <h1>heading</h1>
__useful contents__ to <i>useful contents</i>
**bold texts** to <b>bold texts</b>
Thanks

I'd recommend using Markdown for this purpose (same is used here, in SO).
You can get package that will convert markdown to HTML here in for C#. Or if you want client browser to render it (to save CPU on the server) you can use this JS plugin.

Format dart code as html

I am knocking together a quick debugging view of a backend, as a small set of admin HTML pages (driven by angulardart, but not sure that is critical).
I get back from my XHR call a complex JSON object. I want to see that on the HTML page formatted nicely. It doesn't have to be a great implementation, as its just a debug ui, but the goal is to format the object instead of having it be one long string with no newlines.
I looked at trying to pretty print JSON in dart then putting that inside <pre></pre> tags, as well as just dumping the dart Map object to string (again, inside or not inside <pre></pre> tags. But not getting to where I want.
Even searched pub for something similar, such as a syntax highlighter that would output html, but didn't find something obvious.
Any recommendations?

I think what you're looking for is:
Format your JSON so it's readable
Have syntax highlight
For 1 - This can be done with JsonEncoder with indent
For 2 - You can use the JS lib called HighlightJs pretty easily by appending your formatted json into a marked-up div. (See highlightjs' doc to see what I mean)

Ruby/Rails parse XML without loading it all into memory

I'm wondering if there's an easy way to parse an XML document in rails without loading it all into ram.
I've been using (depending on the XML) a combination of Nokogiri and the standard Hash.from_xml to pull get the contents of the XML.
That is all well and good when I'm dealing with (attempting to import) 100 or even 1000 products. When however the XML doc has 16,000 or 40,000 products in it.... well my Dino starts to really really feel it.
So I'm wondering if there's a way to walk the XML without pulling it all into memory.
Sorry I don't have code.... I'm attempting to avoid writing anything new. I mean who wants to write their own XML parser eh?

I came to this...
reader = Nokogiri::JSON::Reader(File.open('test.xml'))
reader.each do |node|
if(node.name == 'Product')
hash = Hash.from_xml(node.outer_xml).values.first
break;
end
end
I watched my memory load while I ran this across a 60 meg file. It accomplished my goal. I'd love to see other answers. Perhaps something even lighter.

Because XML is hierarchical the parser needs to know the whole structure to parse it correctly. You could feed well formed fragments to Nokogiri::HTML::Document.parse but you'd need to get those fragments out some other way.
Let's say you have a huge xml document:
<products>
<product>stuff</product>
<product>...</product>
... and so on
</products>
The actual products are enveloped within <products>, strip out the envelope part and then using string splitting to get an array of each <product> and its contents. Then parse each of these as XML fragments. Just a thought.
This might help, although I've never used it: https://github.com/soulcutter/saxerator

what are the other setting need to see a html table into excel sheet format in open office org?

I have generated a html table from my web application and save the table into .xls format(in a single word i am generating a .xls sheet from my web application ).
What other setting I have to show it in table form.

You are not producing an XLS file, you are producing a mal-formed HTML file with a name that ends in .xls.
Indeed, you aren't even doing that since there aren't files on the web (there are streams that may or may not end up in files).
Different versions of Open Office, with different settings, will differ in terms of how they deal with stuff that is wrong. The version on one of the machines you are doing is saying "eh, this isn't XLS, oh! it's HTML with a table, I know what to do", while the other is getting as far as "eh, this isn't XLS, it's a bunch of text with strange less-than and greater-than characters all over the place, what do I do".
What you want to do is to produce an actual stream that Open Office and other spreadsheets can deal with. XLS is possible, but pretty hard. Go for CSV instead.
If your table was going to be:
<table>
<tr>
<th>1 heading</th><th>2 & last heading</th>
</tr>
<tr>
<td>1st cell</td><td>This is the "ultimate" cell</td>
</tr>
</table>
Then it sould become:
"1 heading","2 & last heading"
"1st cell","This is the ""ultimate"" cell"
In otherwords newlines to indicate rows, commas to indicate cells, no HTML encoding, quotes around everything and quotes in your actual content doubled-up. (You don't need to always have quotes on your content, but it's never wrong so that's simpler than working out when you do need them).
Now, make your content type "text/csv".
You are now outputting a CSV stream that can be saved as a CSV file. Your spreadsheet software will have a much better idea about what to do with this (it may still ask about character ecodings on opening, but the preview will show you a spreadsheet of data, not a bunch of HTML source all over the place.

It's not really saving as a .xls file -- it appears to be saving as the HTML, but with a .xls extension. How are you generating the .xls? On the server-side, you can provide a button to generate .xls directly (different methods depending on your server platform -- using perl there is the Spreadsheet::WriteExcel module that writes .xls directly, using Java there is JExcel (http://jexcelapi.sourceforge.net/ and POI (http://poi.apache.org/)), other platforms will have their methods.

Okay Subodh, If you want to generate .xls or .csv files, You can't just change the extension of the file and have it open up correctly in that program.
2 Options you have at this point, both involve creating the file with the data on the server and then sending it to the user to download it.
.csv
CSV files are easier to generate from the server side. In a very basic way you can think of them as regular text files with commas(not necessarily only commas) separating individual cells that can be read by spreadsheet programs. For PHP there is an article Here that explains how to generate CSV files.
.xls
xls files are not as simple as simple to generate as CSV files. On the server-side you will need a solution to generate these. For PHP there is a resource Here.
Using xls over CSV has obvious advantage that you can specify formatting and can control visual representation of your data.
Edit :
Upon closely looking at the image you posted, I can see what you are trying to do. If you just want to get that file to open correctly in a spreadsheet program, then don't save it either as CSV or xls
hello.html
<table>
<tr><td>Hi</td><td>Hi</td><td>Hi</td><td>Hi</td></tr>
<tr><td>2</td><td>2</td><td>131</td><td>11312</td></tr>
</table>
Saved as an HTML file will open up correctly(as a proper table) in any spreadsheet program.

To narrow down the problem:
1) Are you opening the same .xls file on both machines?
- what version of OpenOffice is on Machine 1?
- what version of OpenOffice is on Machine 2?
2) How are you creating your .xls file?
- are you just using the response object to change the content-type, or some proprietary software?
- can you include a code sample?
3) Have you tried a pure HTML format?

how to obtain URLs from Dmoz ODP

I want to use a database of URLs present in DMOZ ODP for my application. ( an array of URL strings OR a file containing the same ). Is there any way of obtaining it , ( other than the manual copy-paste ) ?
EDIT :
Is there any script / code to parse the rdf file..

Take a look at http://rdf.dmoz.org/, you'll need to find a way to parse the RDF into your database.
I did this the other day using the odp2db scripts from Steve's Software. They're old, but the format hasn't changed significantly so they work fine.
I found I didn't need to do the iconv and xmlclean.pl steps suggested in the readme, just uncompressed the dumps and ran the structure2db.pl and content2db.pl scripts. You'll need to create the database tables manually (see the SQL at top of script for that) and modify the connection details in the scripts before you start.
With the mid-January 2009 dump I used, there's 756,962 categories and 4,436,796 websites. It took a while to run through them all, but not excessively long, though I did dispense with the site descriptions as I didn't need them. Also, may be worth adding database indices after creating the tables to speed access up later. The raw structure and content files were 75MB and 300MB compressed respectively. 848MB and 2GB respectively.

I've actually done this in java. I just used the SAX API to read through the RDF files. It was pretty straight forward. In my case I wanted to pull out every URL that was in a topic with "Weblogs" in the topic name.
Basically what did was implement a org.xml.sax.helpers.DefaultHandler
Then to setup the code you do:
InputSource is = new InputSource(new FileInputStream("filename.rdf"));
XMLReader r = XMLReaderFactory.createXMLReader();
r.setContentHandler(new MyHandlerClass());
r.parse(is);
and that's pretty much it. In my handler class I had to implement:
startElement(String uri, String localName, String qName, Attributes attributes) then I had an if statement to see if it was an "ExternalPage" tag, in which case I went to another state to look for "topic","Title" and "Description". I had another
characters(char[] ch, int start, int length) where I read in the topic, title, and description text depending on which one had been most recently sent to startElement
endElement(String uri, String localName, String qName) where I checked to see which element was ending, and if it ExternalPage, that meant the end of the current element.
The whole thing was 80-90 lines of code for the basic parsing. So pretty easy to write. It was able to chew through the multi-gigabyte files in... I don't remember maybe a minute or two? If you just want to query out some specific data, it might be easier just to write the code to do that in your handler, rather then trying to load it into a DB.
If you find a tool that works well, that's obviously better then writing your own code. But writing your own code isn't hard! RDF is just an XML format, and it's not nested or anything. A simple SAX parser is easily doable in a day or so.

You could always pay one of the currupt editors there and they will help you out :)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Create csv from html pages - parsing

I was doing a screen-scraping application once and found BeautifulSoup to be very useful. You could easily plop that into a Python script and parse across all the tags with the specific id you're looking for.

Related

In Asp.net using C#, how can I convert ==head== italic and others to html equivalents?

Format dart code as html

Ruby/Rails parse XML without loading it all into memory

what are the other setting need to see a html table into excel sheet format in open office org?

how to obtain URLs from Dmoz ODP

Categories

Resources

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Create csv from html pages - parsing

I was doing a screen-scraping application once and found BeautifulSoup to be very useful. You could easily plop that into a Python script and parse across all the tags with the specific id you're looking for.

Related

In Asp.net using C#, how can I convert ==head== __italic__ and others to html equivalents?

Format dart code as html

Ruby/Rails parse XML without loading it all into memory

what are the other setting need to see a html table into excel sheet format in open office org?

how to obtain URLs from Dmoz ODP

Categories

Resources

In Asp.net using C#, how can I convert ==head== italic and others to html equivalents?