Advanced Excel / Visual Basics for Website Parsing

Advanced Excel / Visual Basics for Website Parsing - parsing

I have links to 500 Wikipedia / Wikimedia Wikis, Talk Pages and history pages in an excel document that I'd like to parse to determine things like how many of the Wikis mention "advert" or "promotional" in the Talk page, how long the average Wiki is, how frequent edits are, etc.
I've figured out how to write a Visual Basics User Defined Function that will get the full HTML. Is there a plugin or some other way to get the text - as it appears on-screen - between two tags or identifiers, so I can pull out the information I need?
I am a business professional with very limited coding experience in comparison to a professional developer. But if you can point me in the right direction and to some good tutorials, I can learn. I'd also be interested in just paying someone a bit of money on the side if someone can help.

You can use XML Parser and Regex to search for text in an HTML document.
To get text as seen on in the browser, write a function to delete all tags. Although, it may not always be accurate as CSS and Javascript can alter what is visible on the screen.

Related

Automatic text / HTML annotation / highlighting

Nowadays there are softwares which, when provided a text or a html document page, will output a summary.
I wonder if there exist anything to automatically annotate (or at least highlight) the same documents.
The idea is to be able to keep the full text, but highlight the most meaningful parts (somehow like a summarisation tool would do I guess). And maybe provide additional inferred insights (?)
Also I would like to know how it works if it exists :) Would it really be very different of summarization, or is it just the same principles with a different "output format"?
I'm looking for something to annotate HTML documents, like AnnotatorJS is designed for, looking like this:

This is not a complete answer, but it can lead to what you want. The first suggestion is looking at GATE. It provides a great annotation framework and as long as you don't want to program anything for it, it is easy to use. The second thing is to search for summarization plug-ins for GATE. GATE has been around for such a long time that I am sure someone has already implemented a summarization plug-in for it.

How to store math equation/symbol and display them on the web?

I want to build a website where people can create tests with questions and answers . I want people can type in math equation/symbol and equations in a textbox or something like that, and they will be store in database, it'also displayed on the web like image.
My idea is i will store the text user input in latex syntax and store it, then display it using MathJax, i don't know it's possible or will have better way to do this.
And a problem is in user input will have normal text with "math text" (latex), so how can i separate them and only save the latex text? Please give me some idea or suggest the way to solve it, thanks.
p/s: i'm building this site in ruby on rails, i found the gem mathjax-rails but it seem not working.

Consider building off Gollum. It is the backend for the wiki system Github uses and works fairly well with LaTex equations (currently their is a very irritating bug with less/greater than symbols, but is documented and likely will be fixed in the next release). I start using it this summer to take notes in a math classes, an example of a full page of rendered LaTex equations notes is here here.
Note: You must be logged into Github in order for the equation to render.

TeX: Add blank page after every content page

I'm currently writing my bachelor thesis and my university wants a one sided print. The printing and binding will be done by a professional print company. They only accept two sided manuscripts.
Because of that I need to add a blank page after every page of content. I don't want to do this manually using \newpage or \clearpage because there are too many pages. Is there any, maybe low level, TeX command or package to do this? Or can you suggest another tool that does this without breaking the PDF?
Thanks for your help!

One option you might look into is to use a double sided layout that allows separate formatting for the even vs. odd pages: e.g. the book class allows this. Then you will need to define the even pages to be blank (presumably you don't want headers printed, or the page count to increment).
An alternative (if you can't get this to look correct for what you need) would be to do the layout in single sided (so that page numbering, etc. is all taken care of), then have a separate latex document which includes the pages, one at a time (pdfpages may be a good package to do this properly), and then insert blank pages (with no headers/etc.) in-between. This may end up being more work, but if you have trouble with formatting, it may be the easier way to go.

I suspect that you'd be better off doing this by manipulating the output PDF, rather than changing the LaTeX.
For example, if you're able to print to a file on your platform, there might be options in the print dialogue to tweak this. Your PDF viewer may be able to arrange this, if only by inserting blanks every second page. Or there may be a GUI or command-line tool to do the reshuffling for you.
Having said that, I've no specific recommendations for what tool you could use. A quick look around suggests strongly that the pstops tool might be able to do something along these lines, but that only helps if you're generating your PDF from postscript.
So no recipe, I'm afraid, but this'll probably be a better direction to look.
(or, meta answer: find a different print shop, or phone again and hope you get someone who gives you a different answer!)

Printing from web pages (reports especially) with greater precision

I am re-engineering a windows application to be ported to web. One area that has been worrying is 'printing'.
The application is data intensive and complex reports need to be generated. The erstwhile windows application takes advantage of printer APIs and extends sophisticated control to the users. It supports functions like page break, avoiding printing on printed parts of the sheet (like letterhead), choice of layouts and orientation, etc. Please note that these setting are not done only while printing, they are part of report definition sometimes.
From what I know, we cannot have this kind of control while printing web pages. I am in a process of identifying options at my disposal. While I prefer to first look into something that will help me print from raw web pages, following are other thoughts:
Since reports can also be exported to .xls & .pdf versions, let user download one and print directly. This however limits my solution to the area of application that have export feature.
Use Silverlight (4.0) for report layout definition and print. I think Silverlight 4.0 (in beta right now) provides adequate control over the printer. I have so far been avoiding the need of any RIA plugin.
Meticulously generate reports on web with fixed dimensions. I am not sure how far this will go.
Please share practices that can be applied easily in my scenario.

For reporting in the past on the web, using .NET, I like to generate PDF, Excel, Word or CSV files. I really like iTextSharp which allows for creating of PDF's.
Word can accept HTML, so that is usually quote easy. For more control you can get into the Word interops http://nishantrana.wordpress.com/2007/11/03/creating-word-document-using-c/, but they left me frustrated. Not for implementation, but I felt the clean up was poor.
CSV are great for raw data dumps and that is it.
For HTML, you can get nice control using a style sheet targeted to print media. There are just certain things you cannot control, like browser header and footer.

Flash also has better print controls than plain HTML, though you might not know it since these features are rarely used by flash developers. Almost everyone should have Flash installed these days, so it's not like Silverlight where there's a good chance of someone needing to install a plugin (doubly so for a beta version). I am not sure how the Flash printer APIs compare to Silverlight's printer APIs and if they give you the level of control you need, but their documentation is public so you can look into it.
Also I think exporting to PDF is a good idea. I don't see why you can't extend this to cover all places that would need to print a report. Basically instead of printing directly from the windows app running on their desktop, the same exact code runs on your server and generates a PDF that they can then print themselves.

I don't think you're going to have much luck trying to do it with raw HTML unfortunately. For one of our clients, we went with the "generate PDF" route and it worked out quite well. PDFs have the additional advantage that you don't have to print them out: you can just email them to the boss/accountant/whatever saving a bit of paper.

PDF is the way to go, if you want absolute control over printed output. As bonus, you can also provide the option to download PDFs in your application.
With HTML, you are at the mercy of user's browser settings for page size, margin and how page breaks will be handled.

Setting up help for a Delphi app

What's the best way to set up help (specifically HTML Help) for a Delphi application? I can see several options, all of which has disadvantages. Specifically:
I could set HelpContext in the forms designer wherever appropriate, but then I'm stuck having to track numbers instead of symbolic constants.
I could set HelpContext programmatically. Then I can use symbolic constants, but I'd have more code to keep up with, and I couldn't easily check the text DFMs to see which forms still need help.
I could set HelpKeyword, but since that does a keyword lookup (like Application.HelpKeyword) rather than a topic jump (like Application.HelpJump), I'd have to make sure that each of my help pages has a unique, non-changing, top-level keyword; this seems like extra work. (And there are HelpKeyword-related VCL bugs like this and this.)
I could set HelpKeyword, set an Application.OnHelp handler to convert HelpKeyword requests to HelpJump requests so that I can assign help by topic ID instead of keyword lookup, and add code such as my own help viewer (based on HelpScribble's code) that fixes the VCL bugs and lets HelpJump work with anchors. By this point, though, I feel like I'm working against the VCL rather than with it.
Which approach did you choose for your app?

When I first started researching how to do this several years ago, I first got the "All About help files in Borland Delphi" tutorial from: http://www.ec-software.com/support_tutorials.html
In that document, the section "Preparing a help file for context sensitive help" (which in my version of the document starts on page 28). It describes a nice numbering scheme you can use to organize your numbers into sections, e.g. Starting with 100000 for your main form and continuing with 101000 or 110000 for each secondary form, etc.
But then I wanted to use descriptive string IDs instead of numbers for my Help topics. I started using THelpRouter, which is part of EC Software's free Help Suite at: http://www.ec-software.com/downloads_delphi.html
But then I settled on a Help tool that supported string ID's directly for topics (I use Dr. Explain: http://www.drexplain.com/) so now I simply use HelpJump, e.g.:
Application.HelpJump('UGQuickStart');
I hope that helps.

We use symbolic constants. Yes, it is a bit more work, but it pays off. Especially because some of our dialogs are dynamically built and sometimes require different help IDs.

I create the help file, which gets the help topic ID, and then go around the forms and set their HelpContext values to them. Since the level of maintenance needed is very low - the form is unlikely to change help file context unless something major happens - this works just fine.

We use Help&Manual - its a wonderful tool, outputting almost any format of stuff you could want, doc, rtf, html, pdf - all from the same source. It will even read in (or paste from rtf (eg MSWord). It uses topic ID's (strings) which I just keep a list of and I manually put each one into a form (or class) as it suits me. Sounds difficult but trust me you'll spend far longer hating the wrong authouring tool. I spent years finding it!
Brian

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Advanced Excel / Visual Basics for Website Parsing - parsing

You can use XML Parser and Regex to search for text in an HTML document. To get text as seen on in the browser, write a function to delete all tags. Although, it may not always be accurate as CSS and Javascript can alter what is visible on the screen.

Related

Automatic text / HTML annotation / highlighting

How to store math equation/symbol and display them on the web?

TeX: Add blank page after every content page

Printing from web pages (reports especially) with greater precision

Setting up help for a Delphi app

Categories

Resources