I am building something that can more or less extract key information from an arbitrary web site. For example, if I crawled a McDonalds page and wanted to figure out programatically the opening and closing time of McDonalds, what is an intelligent way to do it?
In a general case, maybe I also want to find out whether McDonalds sells chicken wings, or the address of McDonalds.
What I am thinking is that I will have a specific case for time, wings, and address and have code that is unique for each of those 3 cases.
But I am not sure how I can approach this. I have the sites crawled and HTML and related information parsed into JSON already. My current approach is something like finding the title tag and checking if the title tag contains key words like address or location, etc. If the title contains those key words, then I will look through the current page and identify chunks of content that resemble an address, such as content that are cities or countries or content that has the word St or Street inside.
I am wondering if there is a better approach to look for key data, and looking for a nicer starting point or bounce some ideas and whatnot. Or even if there are good articles to read about this would be great as well.
Let me know if this is unclear.
Thanks for the help.
In order to parse such HTML pages you have to have knowlege about their structure. There's no general solution for this problem. Each webpage needs its own solution. However, a good approach would be to ensure the HTML code is valid XML too and then use XPath to access elements at known positions. Maybe there's even an XPath like solution for standard HTML (which is not always valid xml). This way you can define a set of XPaths for each page which give you the specific elements if they exist.
Related
I am trying to scrape a website for financials of Indian companies as a side project & put it in Google Sheets using XPATH
Link: https://ticker.finology.in/company/AFFLE
I am able to extract data from elements that have a specific id like cash, net debt, etc. however I am stuck with extracting data for labels like Sales Growth.
tried
Copying the full xpath from console, //*[#id="mainContent_updAddRatios"]/div[13]/p/span - this works, however, i am reliable on the index of the div (13) and that may change for different companies, hence i am unable to automate it.
Please assist with a scalable solution
PS: I am a Product Manager with basic coding expertise as I was a developer few years ago.
At some point you need to "hardcode" something unless you have some other means of mapping the content of the page to your spreadsheet. In your example you appear to be targeting "Sales Growth" percentage. If you are not comfortable hardcoding the index of the div (13), you could identify it by the id of the "Sales Growth" label which is mainContent_lblSalesGrowthorCasa.
For example, change your
//*[#id="mainContent_updAddRatios"]/div[13]/p/span
to:
//*[#id = "mainContent_updAddRatios"]/div[.//span/#id = "mainContent_lblSalesGrowthorCasa"]/p/span
which is selecting the div based on the div containing a span with id="mainContent_lblSalesGrowthorCasa". Ultimately, whether you "hardcode" the exact index of the div or "hardcode" the ids of the nodes, you are still embedding assumptions regarding the structure of page.
Thanks #david, that helped.
Two questions
What if the structure of the page would change? Example: If the website decided to remove the p tag then would my sheet fail? How do we avoid failure in such cases?
Also, since every id is unique, the probability of that getting changed is lesser than the index being changed. Correct me, if I am wrong?
What do we do when the elements don't have an id like Profit Growth, RoE, RoCE etc
I use the JSON data from a Google spreadsheet, for 2 mobile applications (iOS and Android). The same information can be outputted using HTML or XML, in this case I am using HTML so the information shown (from the spreadsheet) can be understood by everyone. The only logical way to do this is without Authentication (O’Auth) is through public URL Injects. Information about what I’m talking can be found here. In order to understand what I’m asking, you have to actually click the links and see for yourself. I do not know what to “call” some of the things I’m asking as Google’s documation is poor, no fault of my own.
In my app I have a search feature that queries the spreadsheet (USING A URL REQUEST) along the lines of this,
https://docs.google.com/spreadsheets/d/1yyHaR2wihF8gLf40k1jrPfzTZ9uKWJKRmFSB519X8Bc/gviz/tq?tqx=out:html&tq=select+A,B,C,D,E+where+(B+contains"Cat")&gid=0
I select my columns (A, B, C, D, and, E) and ask (Google) that only the rows where column B contains the word cat be return. Again I’m stressing the point that this is done via a URL address (inject being the proper term). I CANNOT use almost any function/formulas that would normally work within a spreadsheet like, ArrayFormula or ImportRange. In fact I only access to 10 language clauses (Read link from before). I have a rather well knowledge of spreadsheets and databases, and as the URL method of getting information from them is similar they are in NO way the same thing.
Now, I would like to point out this part within the URL
tq?tqx=out:html&tq=select+A,B,C,D,E+where+(B+contains"Cat")&gid=0
Type of output, HTML in this case
tqx=out:html
The start of query
&tq=
Select columns A-E
select+A,B,C,D,E
For returning specific information about Cat
where+(B+contains"Cat")
This is probably the most important part of my question. This is used for specifying what table (Tab) is being queried.
&gid=0
If the gid is changed from gid=0 to gid=181437435 the data returned is from the spreadsheets second table. Instead of having to make 2 requests to search both tables is there a way to do both in one request? (like combining the 2) <— THIS IS WHAT I’M ASKING.
There is a AND clause that I have tried all over the url
select+A,B,C,D,E+where+(B+contains%20"Cat")&gid=181437435+AND+select+A,B,C,D,E+where+(B+contains%20"Cat")&gid=0
I have even flipped the gid around and put in other places but it seems to only go by the last one (gid) in the url, and no matter what is done only 1 table is returned. Grouping is allowed by the way. If that doesn’t clear my question up then let me know where you’re lost. Also I would have posted more URLs for easy access but I am kind of on this 2 URL maximum program.
If I understand your requirement, indeed it is, with syntax like this for example:
=ArrayFormula(QUERY({Sheet1!A1:C4;Sheet2!B1:D4},"select * order by Col1 desc"))
The ; stacks one array above the other (, for side by side).
My confusions is with "URL Query Language" as what here is called Google Query Language (there is even the tag though IMO almost all those Qs belong on Web Applications - including this one, by my understanding!) is not confined to use with URLs.
In the example above the sheet references might be replaced with data import functions.
My app has thousands (maybe millions?) of models, let's call them Paragraphs, that contain text. The primary use of that text is to display it on a webpage. Sometimes that text is searched over for various other reasons too.
Some of the words in some of these paragraphs have associated metadata, like formatting, hyperlinks or other data-attributes that have meaning for my javascript in the front end.
Right now, I'm just sticking the ultimate html tags straight into the text, so it ends up being stored like this:
<strong>Jimmy</strong> is walking his dog which is <span class="something" data-metadata_id="2343">brown</span>.
This works well for the primary purpose of displaying the text, but is very ugly when I want to search over my text, or do other processing on it. Is there a better way? Is there a gem that handles this sort of thing?
It makes sense to put both versions in your database: a display one and an index one. Disk is cheap. Especially if you're using Solr or similar (very recommended if you're doing string search), you can store (but not index) the HTML, and index (but not store) the plain text version, in two different fields of the same record.
Is there any penalty on Google rankings for using two pages with the same title and/or meta-description? If so, what is the penalty?
Both pages are on the same domain. One page URL is example.com/abcd and the other page URL is example.com/uvwxyz. The H1 header for both pages is the same, and both have the same meta-description.
I don't think Google would punish this.
Think of YouTube (which is owned by Google): The content of the title element follows this schema: [user-contributed video title] - YouTube. The meta-description consists of the user-contributed video description.
Now, there are probably thousands of videos with the very same title ("My cute cat") and some of them could even have the same description ("See my cute cat").
However, if a website consists of many (or even only) pages with same title and meta-description, it gambles away the possibilty for a better ranking. But when all these pages really have different content, it won't be punished.
Title, Meta Description are among the signals which search engines uses to identify topic of the page and rank them in search results. Weight of Title is high in search rankings & both title/description are displayed in search results along with URL.
As you have mentioned content of both pages are different, than by
having duplicate title/description you are loosing some opportunity
of targeting different keywords for search rankings.
Having same title/description makes it difficult for both user as well as search
engines to identify & differentiate between them.
Even though there is no negative influence, but you are loosing on important signal (title) which can help in improving search ranking.
Some ref reading on title: http://www.searchenabler.com/blog/title-tag-optimization-tips-for-seo/
& duplicate content: http://www.searchenabler.com/blog/learn-seo-duplicate-content/
There is not a punishment per se' it just isn't best practice to use. Why will you have duplicate meta information? Is the information the same on each page? Does it need to be?
I'm really stuck on this. Don't even know where to start. So I've got this .pdf, which has 2 columns, the first one is the lets say member ID. The second one is the number of purchases they have made. Is it possible to match the ID to the correct number and graph this data, and afterwards make calculations with the acquired and matched data (Calculate top 5% of buyers etc.)? Some numbers are not filled in, so that might be a problem. However, the pdf's are selectable and if copy&pasted will have the following structure: userid number userid number userid number userid number userid number.
EDIT: Making calculations with the data (calculating the top x%, ranks etc. will be the most important)
Any help, tips or links to tutorials that even might help me are appreciated!
Use prawn.
Here are some links to get you started:
Prawn github page
Using-Prawn-in-Rails
and, look for Prawn Templates.
EDIT:
Check out these links:
pdfescape
pdfedit
and Do look out for a templating solution, if it's there.
Also look here, you might find something useful:
whats-the-best-way-to-programmatically-edit-a-pdf-in-ruby
As I have not dealt with such problem mysqlf, I can only help you this much. You have to do the hard work yourself.