I have experimented with Google services, App Script, Integromat and done some research and cannot find the automated solution I am looking for.
Basically, via an Integromat (now called Make) workflow, once a client has completed an online survey, a unique Google Sheet will be generated for them by duplicating a template G-Sheet and inserting their data, based on answers to the survey.
The template, and each subsequent duplicate made for clients, contains graphs that are mapped according to the users input from the survey.
One of the "tabs" or "worksheets" in these G-Sheets contain a summary of the clients graphs, among some text. This tab is neatly formatted to look like a report, all it requires is a conversion into a PDF, so that it may be uploaded to our G-Drive and emailed to the client.
Hope that all makes sense and hope someone has a solution!
I'm looking to add [RentalCarReservation] (https://schema.org/RentalCarReservation) to the consumer-side confirmation emails sent for a large, multinational rental agency but am running into two key questions:
Is there a corresponding Google Now tag that will correctly handle and parse vehicle rentals in particular at this time, or should we use a more generic order markup scheme until such time as there is support for this? It should be noted that none of our competitors seem to be using microdata at all, so there's no industry trend.
As asked earlier on this tag, what is the state of JSON-LD adoption for Google Now tags? By its nature the RentalCarReservation schema requires JSON rather than RDFa or similar.
RentalCarReservation is currently supported by Gmail markup, you can find more info and JSON/microdata examples here:
https://developers.google.com/gmail/markup/reference/rental-car#basic_reservation_confirmation
JSON is supported, as stated in the following link:
https://developers.google.com/gmail/markup/reference/formats/json-ld
In order to be enabled to send RentalCarReservation to your Gmail customers, you will have to register and provide a sample email, more info here:
https://developers.google.com/gmail/markup/registering-with-google?hl=en
I just got a new task; I need to:
"Team have an Excel spreadsheet which contains basic data about subjects in a trial. A custom VB script in the Excel sheet allows them to generate BO reports. At a very high level, they are able to select a particular cell in the Pivot table and then execute a VB script via a custom button in the Excel sheet. The script generates an XML file containing metadata about those subjects and some additional parameters such as report title and saves the XML file to a watched directory. An external process picks it up and generates a fully-formatted BO report with complete details about the identified subjects from the drug safety database, and sends it to the user as a PDF in a new browser window.
So really this is not about doing data analysis within Spotfire (including R), but it is more to do with building an interface from Spotfire to the BO reporting environment. I believe this can be done with the Spotfire SDK, which uses IronPython as its scripting language, but I cannnot say for certain because I have very little experience in that area."
Is there any chance for some high level suggestion of what approach do I have to take in order to attain the functionality requested ?
I also dont have any exp with Spotfire's tibco technical support. Do you guys think I can ask any such questions to them ?
Regards,
Jacek
Hi and thanks in advance for helping me with my question.
Is it possible to write a script that would extract the following information when provided with a craigslist or kiji post ie http://toronto.en.craigslist.ca/tor/atq/3346994296.html:
email address (default one provided by craigslist)
items in the post
address of poster
Above 1-3 is information that can be manually obtained but would like to just input a posting or ad ID and be able to extract this info.
The short answer to this question is...
Yes, automatically extracting the info listed from web pages similar to the one provided as example can be done by a relatively simple script.
In general, this activity [of automatically extracting info from web pages] is known as Web Scraping, a particular form of Data Scraping.
There are both off-the-shelf products that can be used (no or little programming involved; the parametrization of the desired pages and desired fields within the pages is specified by way of configuration.), as well as software libraies which can be used in relation with scripting languages such as python or java and which facilitate the parsing of HTML page, and more generally provide support for the various tasks associated with this activity.
Aside from technical considerations, you need to assert the etiquette and legality of performing this kind of scraping. Whereby some data and sites may be explicitly copyright-protected, it is always a good idea to perform big scraping jobs at low traffic hours and by throttling the requests as to not burden the host site unduly. Also many sites may provide an API or data dumps to supply the same info in a simpler and more controlled fashion.
I want to know if there is a better way of extracting info from a web page than parsing the HTML for what i'm searching. ie: Extracting movie rating from 'imdb.com'
I'm currently using the IndyHttp components to get the page and i'm using strUtils to parse the text but the content is limited.
I found plain simple regex-es to be highly intuitive and simple when dealing with good web-sites, and IMDB is a good web site.
For example the movie rating on the IMDB's movie HTML page is in a <DIV> with class="star-box-giga-star". That's VERY easy to extract using a regular expression. The following regular expression will extract the movie rating from the raw HTML into capture group 1:
star-box-giga-star[^>]*>([^<]*)<
It's not pretty, but it does the job. The regex looks for the "star-box-giga-star" class id, then it looks for the > that terminates the DIV, and then captures everything until the following <. To create a new regex like this you should use a web browser that allows inspecting elements (for example Crome or Opera). With Chrome you can simply look at the web-page, right-click on the element you want to capture and do Inspect element, then look around for easily identifiable elements that can be used to create a good regex. In this case the "star-box-giga-star" class is obviously easily identifiable! You'll usually have no problem finding such identifiable elements on good web sites because good web sites use CSS and CSS requires ID's or class'es to be able to style the elements properly.
Processing RSS feed is more comfortable.
As of the time of posting, the only RSS feeds available on the site are:
Born on this Date
Died on this Date
Daily Poll
Yet, you may make a call for adding a new one by getting in touch with the help desk.
Resources on RSS feed processing:
Relevant post here on SO.
Super Object
Wikipedia.
When scraping websites, you cannot rely on the availability of the information. IMDB may detect your scraping and attempt to block you, or they may frequently change the format to make it more difficult.
Therefore, you should always try to use a supported API Or RSS feed, or at least get permission from the web site to aggregate their data, and ensure that you're abiding by their terms. Often, you will have to pay for this type of access. Scraping a website without permission may open you up to liability on a couple legal fronts (Denial of Service and Intellectual Property).
Here's IMDB's statement:
You may not use data mining, robots, screen scraping, or similar
online data gathering and extraction tools on our website.
To answer your question, the better way is to use the method provided by the website. For non-commercial use, and if you abide by their terms, you can download the IMDB database directly and use the data from there instead of scraping their site. Simply update your database frequently, and it's a better solution than scraping the site. You could even wrap your own web API around it. Ratings are available as a standalone table.
Use HTML Tidy to convert any HTML to valid XML and then use an XML parser, maybe using XPATH or developing your own code (which is what I do).
All the answers posted cover well your generic question. I usually follow an strategy similar to the one detailed by Cosmin. I use wininet and regex for most of my web extraction needs.
But let me add my two cents at the specific subquestion on extracting imdb qualification. IMDBAPI.COM provides a query interface returning json code, which is very handy for this type of searches.
So a very simple command line program for getting a imdb rating would be...
program imdbrating;
{$apptype console}
uses htmlutils;
function ExtractJsonParm(parm,h:string):string;
var r:integer;
begin
r:=pos('"'+Parm+'":',h);
if r<>0 then
result:=copy(h,r+length(Parm)+4,pos(',',copy(h,r+length(Parm)+4,length(h)))-2)
else
result:='N/A';
end;
var h:string;
begin
h:=HttpGet('http://www.imdbapi.com/?t=' + UrlEncode(ParamStr(1)));
writeln(ExtractJsonParm('Rating',h));
end.
If the page you are crawling is valid XML, i use SimpleXML to extract infos. Works pretty well.
Resource:
Download link.