Calulating and graphing data from a .pdf with ruby (+ rails) - ruby-on-rails

I'm really stuck on this. Don't even know where to start. So I've got this .pdf, which has 2 columns, the first one is the lets say member ID. The second one is the number of purchases they have made. Is it possible to match the ID to the correct number and graph this data, and afterwards make calculations with the acquired and matched data (Calculate top 5% of buyers etc.)? Some numbers are not filled in, so that might be a problem. However, the pdf's are selectable and if copy&pasted will have the following structure: userid number userid number userid number userid number userid number.
EDIT: Making calculations with the data (calculating the top x%, ranks etc. will be the most important)
Any help, tips or links to tutorials that even might help me are appreciated!

Use prawn.
Here are some links to get you started:
Prawn github page
Using-Prawn-in-Rails
and, look for Prawn Templates.
EDIT:
Check out these links:
pdfescape
pdfedit
and Do look out for a templating solution, if it's there.
Also look here, you might find something useful:
whats-the-best-way-to-programmatically-edit-a-pdf-in-ruby
As I have not dealt with such problem mysqlf, I can only help you this much. You have to do the hard work yourself.

Related

How do i trace multiple XML elements with same name & without any id?

I am trying to scrape a website for financials of Indian companies as a side project & put it in Google Sheets using XPATH
Link: https://ticker.finology.in/company/AFFLE
I am able to extract data from elements that have a specific id like cash, net debt, etc. however I am stuck with extracting data for labels like Sales Growth.
tried
Copying the full xpath from console, //*[#id="mainContent_updAddRatios"]/div[13]/p/span - this works, however, i am reliable on the index of the div (13) and that may change for different companies, hence i am unable to automate it.
Please assist with a scalable solution
PS: I am a Product Manager with basic coding expertise as I was a developer few years ago.
At some point you need to "hardcode" something unless you have some other means of mapping the content of the page to your spreadsheet. In your example you appear to be targeting "Sales Growth" percentage. If you are not comfortable hardcoding the index of the div (13), you could identify it by the id of the "Sales Growth" label which is mainContent_lblSalesGrowthorCasa.
For example, change your
//*[#id="mainContent_updAddRatios"]/div[13]/p/span
to:
//*[#id = "mainContent_updAddRatios"]/div[.//span/#id = "mainContent_lblSalesGrowthorCasa"]/p/span
which is selecting the div based on the div containing a span with id="mainContent_lblSalesGrowthorCasa". Ultimately, whether you "hardcode" the exact index of the div or "hardcode" the ids of the nodes, you are still embedding assumptions regarding the structure of page.
Thanks #david, that helped.
Two questions
What if the structure of the page would change? Example: If the website decided to remove the p tag then would my sheet fail? How do we avoid failure in such cases?
Also, since every id is unique, the probability of that getting changed is lesser than the index being changed. Correct me, if I am wrong?
What do we do when the elements don't have an id like Profit Growth, RoE, RoCE etc

Google Sheets import multiple HTML table images

Summary
I'm looking to import a data table from a website that does not appear to have an API. The table is broken down to various images and text. The goal is to have all of the content available in a table to then reference for other sheets.
Issue
When I pull in the data, I get some of the text, none of the other images, and a reference to another table. I looked up some options, but none of them yielded anything but blank cells.
I also tried to use the =IMAGE() formula with a direct link to the images URLs, but there is a portion of the URL that is specific to the unit's release date, and as such, too dynamic to account for.
Excel Formula
=IMPORTHTML("https://gamepress.gg/pokemonmasters/database/sync-pair-list","table",3)
Unfortunately without an API it is going to be difficult to achieve what you aim here. These are the main reasons why:
PROBLEMS AND WORKAROUNDS
This table has nested tables that therefore need to be accessed separately. If you take a look at: =IMPORTHTML("https://gamepress.gg/pokemonmasters/database/sync-pair-list","table",4)
you will see how the table 4 of this HTML page is the stats of a random character of the main table. If you go for 5 or 6 you will realise that the nested tables are not even numerically ordered and that you cannot access them by accessing to the main table (i.e mainTable[0].nestedTable). A hard working approach to do this is to go one by one finding their corresponding stat table and placing next to it. For this I recommend extracting only the name field of the main table to be able to align each stat to their character. You can simply do this using:=INDEX(IMPORTHTML("https://gamepress.gg/pokemonmasters/database/sync-pair-list","table",3),0,1). You can find out more about INDEX here
IMPORTHTML cannot access images nor links so it will be very difficult to get the images in the last columns. A way to solve this is by using as you mentioned the image with its url like this: =IMAGE("https://gamepress.gg/pokemonmasters/sites/pokemonmasters/files/styles/30x30/public/2019-07/Electric.png?itok=fkRfkrFX"). You can find more info about inserting images here
CONCLUSION
To sum up, there is no easy way to solve this problem. The closest you can get is by:
Importing the name column.
Figuring out which tables belong to which character and placing them with next to their name.
Getting the image url of each weakness and type and add it to each character.
I am sorry this site does not have an API to make things smooth, good luck with your project and let me know if you need anything else or if you did not understand anything.
Here you can find more information about IMPORTHTML

The data in my Google Sheet is not appearing in the correct cell

I am trying to create a spreadsheet to simplify our account returns. I am using a variety of named ranges to make life easier. I have created a test sheet which automatically copies the inputted cost to its appropriate category.
I am having a strange issue where the cell I am expecting to see the data in is incorrect. I am wondering if the 2 data validation lists I have created could be causing the issue. I had originally copy / pasted from an old sheet but as I wondered if some strange formatting may have been carried over which is causing the issue I have since manually entered all data to remove this as a potential cause.
https://docs.google.com/spreadsheets/d/1KC8FsVNQZfWtey5TvPDxCxvDhRrFVHdsbDJWmZu73wg/edit#gid=0
This is the test sheet in question. The cost for entries Test 9 & Test 10 should be in the Info Books and Stationary sections retrospectively but they are ending up in the wrong places.
I am not a spreadsheet expert so I apologise if I am missing something blatently obvious. A friend advised me to ask on Stackoverflow after many hours lost to this problem.
Thanks in advance for any help you may be able to give.
Use the third parameter in vlookup set to false (or zero)
=IF(VLOOKUP(companyOfPurchase,suppliersAndCategories,2, 0) = typeOfPurchase,totalOfReceipt,"")
and see if that works?

Wikipedia pageviews analysis

I've been challenged with wikipedia pageviews analysis. For me this is the first project with such amount of data and I'm a bit lost. When I download the file from the link and unpack it, I can see that it has a table-like structure with rows looking like this:
1 | 2 |3|4
en.m The_Beatles_in_the_United_States 2 0
I struggle with finding out what exactly can be found in each column. My guesses:
language version and additional info (.m = mobile?)
name of the article
The biggest concern I have with two last columns. The last one has only "0" values in it and I have no idea what it represents. I'd assume then that the third one show number of views but I'm not sure.
I'd be grateful if someone could help me to understand what exactly can be found in each column or recommend some reading on this subject. Thanks!
After more time spent on this, I've finally found solution. I'm posting this in case someone has the same problem in the future. Wikipedia explains what can be found in database. These explanations were painful to find but you can access theme here and here.
Based on that you can see that rows have following structure:
domain code
page_title
count_views
total_response_size (no longer maintained)
Some explanations for each column:
Column 1:
Domain name of the request, abbreviated. (...) Domain_code now can
also be an abbreviation for mobile and zero domain names, in which
case .m or .zero is inserted as second part of the domain name (just
like with full domain name). E.g. 'en.m.v' stands for
"en.m.wikiversity.org".
Column 2:
For page-level files, it holds the title of the unnormalized part
after /wiki/ -in the request Url (E.g.: Main_Page Berlin). For
project-level files, it is - .
Column 3:
The number of times this page has been viewed in the respective hour.
Column 4:
The total response size caused by the requests for this page in the
respective hour. If I understand it correctly response size is
discontinued due to low accuracy. That's why there are only 0s. The
pagecounts and projectcounts files also include total response byte
sizes at their respective aggregation level, but this was dropped from
the pageviews and projectviews files because it wasn't very accurate.
Hope someone finds it useful.
Line format:
wiki code (subproject.project)
article title
monthly total (with interpolation when data is missing)
hourly counts
(From pagecounts-ez which is the same dataset just with less filtering.)
Apparently buggy though; it takes the first two parts of the domain name for wiki code, which does not work for mobile domains (which are in the form <language>.m.<project>.org).

How can I smartly extract information from an HTML page?

I am building something that can more or less extract key information from an arbitrary web site. For example, if I crawled a McDonalds page and wanted to figure out programatically the opening and closing time of McDonalds, what is an intelligent way to do it?
In a general case, maybe I also want to find out whether McDonalds sells chicken wings, or the address of McDonalds.
What I am thinking is that I will have a specific case for time, wings, and address and have code that is unique for each of those 3 cases.
But I am not sure how I can approach this. I have the sites crawled and HTML and related information parsed into JSON already. My current approach is something like finding the title tag and checking if the title tag contains key words like address or location, etc. If the title contains those key words, then I will look through the current page and identify chunks of content that resemble an address, such as content that are cities or countries or content that has the word St or Street inside.
I am wondering if there is a better approach to look for key data, and looking for a nicer starting point or bounce some ideas and whatnot. Or even if there are good articles to read about this would be great as well.
Let me know if this is unclear.
Thanks for the help.
In order to parse such HTML pages you have to have knowlege about their structure. There's no general solution for this problem. Each webpage needs its own solution. However, a good approach would be to ensure the HTML code is valid XML too and then use XPath to access elements at known positions. Maybe there's even an XPath like solution for standard HTML (which is not always valid xml). This way you can define a set of XPaths for each page which give you the specific elements if they exist.

Resources