Wikipedia pageviews analysis - analysis

I've been challenged with wikipedia pageviews analysis. For me this is the first project with such amount of data and I'm a bit lost. When I download the file from the link and unpack it, I can see that it has a table-like structure with rows looking like this:
1 | 2 |3|4
en.m The_Beatles_in_the_United_States 2 0
I struggle with finding out what exactly can be found in each column. My guesses:
language version and additional info (.m = mobile?)
name of the article
The biggest concern I have with two last columns. The last one has only "0" values in it and I have no idea what it represents. I'd assume then that the third one show number of views but I'm not sure.
I'd be grateful if someone could help me to understand what exactly can be found in each column or recommend some reading on this subject. Thanks!

After more time spent on this, I've finally found solution. I'm posting this in case someone has the same problem in the future. Wikipedia explains what can be found in database. These explanations were painful to find but you can access theme here and here.
Based on that you can see that rows have following structure:
domain code
page_title
count_views
total_response_size (no longer maintained)
Some explanations for each column:
Column 1:
Domain name of the request, abbreviated. (...) Domain_code now can
also be an abbreviation for mobile and zero domain names, in which
case .m or .zero is inserted as second part of the domain name (just
like with full domain name). E.g. 'en.m.v' stands for
"en.m.wikiversity.org".
Column 2:
For page-level files, it holds the title of the unnormalized part
after /wiki/ -in the request Url (E.g.: Main_Page Berlin). For
project-level files, it is - .
Column 3:
The number of times this page has been viewed in the respective hour.
Column 4:
The total response size caused by the requests for this page in the
respective hour. If I understand it correctly response size is
discontinued due to low accuracy. That's why there are only 0s. The
pagecounts and projectcounts files also include total response byte
sizes at their respective aggregation level, but this was dropped from
the pageviews and projectviews files because it wasn't very accurate.
Hope someone finds it useful.

Line format:
wiki code (subproject.project)
article title
monthly total (with interpolation when data is missing)
hourly counts
(From pagecounts-ez which is the same dataset just with less filtering.)
Apparently buggy though; it takes the first two parts of the domain name for wiki code, which does not work for mobile domains (which are in the form <language>.m.<project>.org).

Related

How do i trace multiple XML elements with same name & without any id?

I am trying to scrape a website for financials of Indian companies as a side project & put it in Google Sheets using XPATH
Link: https://ticker.finology.in/company/AFFLE
I am able to extract data from elements that have a specific id like cash, net debt, etc. however I am stuck with extracting data for labels like Sales Growth.
tried
Copying the full xpath from console, //*[#id="mainContent_updAddRatios"]/div[13]/p/span - this works, however, i am reliable on the index of the div (13) and that may change for different companies, hence i am unable to automate it.
Please assist with a scalable solution
PS: I am a Product Manager with basic coding expertise as I was a developer few years ago.
At some point you need to "hardcode" something unless you have some other means of mapping the content of the page to your spreadsheet. In your example you appear to be targeting "Sales Growth" percentage. If you are not comfortable hardcoding the index of the div (13), you could identify it by the id of the "Sales Growth" label which is mainContent_lblSalesGrowthorCasa.
For example, change your
//*[#id="mainContent_updAddRatios"]/div[13]/p/span
to:
//*[#id = "mainContent_updAddRatios"]/div[.//span/#id = "mainContent_lblSalesGrowthorCasa"]/p/span
which is selecting the div based on the div containing a span with id="mainContent_lblSalesGrowthorCasa". Ultimately, whether you "hardcode" the exact index of the div or "hardcode" the ids of the nodes, you are still embedding assumptions regarding the structure of page.
Thanks #david, that helped.
Two questions
What if the structure of the page would change? Example: If the website decided to remove the p tag then would my sheet fail? How do we avoid failure in such cases?
Also, since every id is unique, the probability of that getting changed is lesser than the index being changed. Correct me, if I am wrong?
What do we do when the elements don't have an id like Profit Growth, RoE, RoCE etc

Data seems to be missing in Bigquery SEC Filing Dataset

I was pleased recently to discover that Bigquery hosts a dataset of SEC filings. I am unable to find the actual text of the filings in the dataset however! This seems so obvious I must be missing something.
As an example, the 2018 Microsoft 10-K filing on the SEC website itself can be seen to have the 10-K text in which Item 7 includes the phrase "Management’s Discussion and Analysis of Financial Condition and Results". I searched for this phrase in the Dataset.
First, the following query should pull all the text from this filing:
SELECT *
FROM `bigquery-public-data.sec_quarterly_financials.txt`
WHERE submission_number="0001564590-18-019062"
The results of this query, when searched for the above phrase, finds nothing however.
A second attempt based on another StackOverflow answer gave me this, in which I try to search the entire dataset for that phrase in case it's stored in a different table:
SELECT *
FROM `bigquery-public-data.sec_quarterly_financials.*` t
WHERE REGEXP_CONTAINS(LOWER(TO_JSON_STRING(t)), r'/^discussion and analysis of financial condition$/')
No result!
I can clearly find the same SEC filing, and yet content within it seems to be missing. I've searched other phrases and sections too, the text seems not to be there. Yet, based on all the Google documentation I think it should be. What am I missing?
Alternatively, anyone know of another source for parsing sections of SEC 10-K filings or the like? That would be useful too and you could also answer this question with it.

Creating Dynamic Sheet Cell Reference List for pulling numbers to SUM

I've been working on building a data analysis sheet, which is quite verbose at the moment and a bit more complicated than it should be as I've been trying to figure this out. Please note, I work doing student data in a school.
Basically, I have two sets of input data:
Data imported from a CSV file that includes test data and codes for Common Core Standards and the questions tied to those standards as a whole class summary
Data imported from a CSV file that includes individual scores by question
I am looking to construct 2 views:
A view that collates and displays data of individual standards per student that includes a dropdown to change the standard allowing a teacher to see class performance by standard in a broad view. The drop-down is populated dynamically from the input data (so staff could eventually dump data and go directly to reports)
A view that collates and displays data of individual students broken down by performance on each standard allowing a teachers to see the broader spectrum for each student. The student drop-down is populated from Source list 2.
I have been able to build the first view, but am struggling with the second. I've been able to separate the question codes and develop strings of cell references to the scoring data, including a dynamic reference to the row the selected student's score data appears on in the second source set from above.
I tried to pass through an indirect() formula into a sum() so as to process for a mean evaluation, and have encountered errors. I think SUM() doesn't process comma-separated cell reference lists from Indirect() [or in general] or there is something that I am missing to help parse it. Here is the formula I have tried:
=Sum(vlookup(D7,CCCodeManip!$A:$C,3,false))
CCCodeManip!C:C includes the created text (based on the dynamic standards and question codes, etc), here's an example of what would be found there:
'M-ADI'!M17, 'M-ADI'!N17, 'M-ADI'!O17, 'M-ADI'!P17, 'M-ADI'!Q17, 'M-ADI'!R17, 'M-ADI'!J17
I need these to be dynamic so that teachers can input different sets of standards, question, and student data and the sheet automatically collates and reports it in uniform ways (with an upward bound of 20 standards as I currently have it built)
Here is a link to the sheet I built, with names and ID anonymized. There's a CRAP TON of sub-tabs, and that's really just being able to split apart and re-combine data neatly without things error-ing out due to data overlapping, aside from a few different attempts and different approaches to parse the cell reference strings.
The first two tabs are the current status of the data views. I plan to hide a bunch of the functional stuff that is there to help pull data accurately.
The 3rd and 4th tab are the source data sets. 5th is a modified version of source data that allows me to reference things better, and I've tried to arrange the sheets most relevant towards the front of the set.
https://docs.google.com/spreadsheets/d/1fR_2n60lenxkvjZSzp2VDGyTUO6l-3wzwaV4P-IQ_5Y/edit?usp=sharing
Some have a different approach? I am aware that I might be as far as I cn go with this and perhaps should consider scripts - my coding experience is a bit out of date and my strength is more with the formulas, but I can dig into things with some direction, if anyone can help.
Ok so I noticed something.
It seems the failure is in the indirect reference:
=indirect(CCCodeManip!C3)
The string I am trying to parse via indirect is going to be generated into something like this, dynamic from reference to other data:
'M-ADI'!M17, 'M-ADI'!N17, 'M-ADI'!O17, 'M-ADI'!P17, 'M-ADI'!Q17, 'M-ADI'!R17, 'M-ADI'!J17
The indirect returns the error that the above string is not a cell reference with the #REF code.
Can someone give me a clue as to what is causing this? I am going to dig into the docs on Indirect() from google and will post anything that I find.
Perhaps it is that indirect() can't handle lists, but only specific references and arrays, which may require me a to build a sheet to do the SUM formula on for each question set (?)
So I think I figured it out, but i Ended up parsing the data differently, basically doing the sum based on individual cell references and a separate sum formula, bypassing the need to do it all at once, it jsut makes my sheets a lot dirtier! I am eventually going to see if code could do it better if I need to, but this is closed for now.
Basically, I did individual cell references to recall scores in a row, then used a separate SUM formula, and created references / structures to be able to pull those sum() results. Achieves the same end, but with extra crap on the sheet.

SPSS Frequency Plot Complication

I am having a hard time generating precisely the frequency table I am looking for using SPSS.
The data in question: cases (n = ~800) with categorical variables DX_n (n = 1-15), each containing ICD9 codes, many of which are the same code. I would like to create a frequency table that groups the DX_n variables such that I can view frequency of every diagnosis in this sample of cases.
The next step is to test the hypothesis that the clustering of diagnoses in this sample is different than that of another. If you have any advice as to how to test this, that would be really appreciated as well!
Thanks!
Edit: My attempts:
1) Analyze -> Descriptive Statistics -> Frequencies; then add variables DX_n (1-15) and display frequency charts. The output is frequencies of each ICD9 code per DX_n variable (so 15 tables are generated - I'm hoping to just have one grouped table).
2) I tried adjusting the output format to organize by variable and also to compare variables but neither option gives the output I'm looking for.
I think what you are looking for CTABLES. It can do parallel columns of frequencies, and it includes a column proportions test that can see whether the distributions differ
Thank you, JKP! You set me on exactly the right track. I'm not sure how I overlooked that menu. Just to clarify in case anyone else comes along needing to figure this out:
Group diagnosis variables into a multiple response set using Analyze > Custom Tables > Multiple Response Sets. Code the variables as categories.
http:// i.imgur.com/ipE9suf.png
Create a custom table with your new multiple response set as a row and the subsets to compare as columns. I set summary statistics to compute from rows and added the column n% column (sorted descending).
http:// i.imgur.com/hptIkfh.png
Under test statistics, include a column proportions z-test as JKP suggested.
http:// i.imgur.com/LYI6ZRl.png
Behold, your results:
http:// i.imgur.com/LgkBA8X.png
Thanks again, and best of luck to anyone else who runs across this.
-GCH
p.s. Sorry everyone, I was going to post images but don't have enough reputation points yet. Images detailing the steps in the GUI can be found at the obfuscated links above.

Calulating and graphing data from a .pdf with ruby (+ rails)

I'm really stuck on this. Don't even know where to start. So I've got this .pdf, which has 2 columns, the first one is the lets say member ID. The second one is the number of purchases they have made. Is it possible to match the ID to the correct number and graph this data, and afterwards make calculations with the acquired and matched data (Calculate top 5% of buyers etc.)? Some numbers are not filled in, so that might be a problem. However, the pdf's are selectable and if copy&pasted will have the following structure: userid number userid number userid number userid number userid number.
EDIT: Making calculations with the data (calculating the top x%, ranks etc. will be the most important)
Any help, tips or links to tutorials that even might help me are appreciated!
Use prawn.
Here are some links to get you started:
Prawn github page
Using-Prawn-in-Rails
and, look for Prawn Templates.
EDIT:
Check out these links:
pdfescape
pdfedit
and Do look out for a templating solution, if it's there.
Also look here, you might find something useful:
whats-the-best-way-to-programmatically-edit-a-pdf-in-ruby
As I have not dealt with such problem mysqlf, I can only help you this much. You have to do the hard work yourself.

Resources