Data seems to be missing in Bigquery SEC Filing Dataset

Data seems to be missing in Bigquery SEC Filing Dataset - parsing

I was pleased recently to discover that Bigquery hosts a dataset of SEC filings. I am unable to find the actual text of the filings in the dataset however! This seems so obvious I must be missing something.
As an example, the 2018 Microsoft 10-K filing on the SEC website itself can be seen to have the 10-K text in which Item 7 includes the phrase "Management’s Discussion and Analysis of Financial Condition and Results". I searched for this phrase in the Dataset.
First, the following query should pull all the text from this filing:
SELECT *
FROM `bigquery-public-data.sec_quarterly_financials.txt`
WHERE submission_number="0001564590-18-019062"
The results of this query, when searched for the above phrase, finds nothing however.
A second attempt based on another StackOverflow answer gave me this, in which I try to search the entire dataset for that phrase in case it's stored in a different table:
SELECT *
FROM `bigquery-public-data.sec_quarterly_financials.*` t
WHERE REGEXP_CONTAINS(LOWER(TO_JSON_STRING(t)), r'/^discussion and analysis of financial condition$/')
No result!
I can clearly find the same SEC filing, and yet content within it seems to be missing. I've searched other phrases and sections too, the text seems not to be there. Yet, based on all the Google documentation I think it should be. What am I missing?
Alternatively, anyone know of another source for parsing sections of SEC 10-K filings or the like? That would be useful too and you could also answer this question with it.

Related

Google data studio - Use multiple datasheet with same data keys/headers

So I've been stuck in this for some days, tryed a lot of search terms but all of them seems to bring me the same answers and i really need this:
I have a demand to join two different company's datas from the same owner, all of them have the same data sources (excel data sheets from FB ADS).
So they all share the same (keys/headers), like this:
COMPANY(1)'S ADS DATA
COMPANY(2)'S ADS DATA
So this way I need to put then togheter without having to join both of then on excel every time and also give him some nice data manipulation power.
The results should be something like this
By now I was trying to join data from the two companys but I couldn't really figure out how to properly do this so far I've made some tests and tryed reading a couple of articles and google data studio's help files. The merging data function seems to mess everything.
As a result of this merge, GDS gives me this fields:
Shouldn't I see like only one field labeled as cnt and cmp? I've noticed that GDS creates not one, but two data fields. If I try adding all data I need as key the left sheet turns all "0s". What Am I doing wrong here?

I have read your descriptions. It seems that you are looking for a solution to append both tables instead of merging the tables.
Do note that the data blending in GDS is a left outer join.
Hence, instead of doing the blending in GDS, I'd suggest you to append both datasets in Google Sheet in a separate tab before importing to GDS for visualisation. (assuming you don't mind copy-pasting the data into the Google Sheet).
Here is the formula to append both datasets in Google Sheets:
= {QUERY(A!A1:D1000,"SELECT A,B,C,D WHERE A <> ''",1);QUERY(B!A2:D, "SELECT A,B,C,D WHERE A <> '' ")}
I've created some dummy data in this google sheets and appended the data using the formula provided , you may take a look to understand further.
If you are unclear on the difference between merge and append, you may take a look in the Google Sheet documentation as well.
On a side note, I've screencast the process of answering this question and posted on my youtube channel. You may take a look if needed. (Thanks for the question and inspiration you provided for the video)

using sum and countifs to get a percentage of 'yes'es across multiple columns by month and team - is there a simpler way?

I've been asked to create a summary for some google form responses, and though I have a working solution, I can't help but feel there must be a more elegant one.
The form collects data related to case checking - every month each team (there's 100+ teams) has to check a certain number of cases based on how many staff are in their team, and enter the results for each case they've checked in the google form. The team that have set this up want me to summarise the data by team, month, and section of the form (preliminary questions, case recording, outcomes, etc). There are 8 sections on the live form, ranging from 1-13 questions, all with Yes/No/NA/blank answers.
(honestly, it's not how I'd have approached setting all this up, but that is out of my hands!)
So they're essentially looking for a live monthly summary with team names down the side, section names along the top, and a %age completed that will keep up with entries as they come in (where we can also use importrange and query to pull the relevant bits into other google sheet summaries, as and when needed).
What I've currently got is this:
=iferror(sum(countifs('Form Responses'!$B:$B,$A3,'Form
Responses'!$F:$F,"Yes",'Form Responses'!$E:$E,">="&$B$1,'Form
Responses'!$E:$E,"<"&edate($B$1,1)),countifs('Form
Responses'!$B:$B,$A3,'Form Responses'!$G:$G,"Yes",'Form
Responses'!$E:$E,">="&$B$1,'Form
Responses'!$E:$E,"<"&edate($B$1,1)),countifs('Form
Responses'!$B:$B,$A3,'Form Responses'!$H:$H,"Yes",'Form
Responses'!$E:$E,">="&$B$1,'Form
Responses'!$E:$E,"<"&edate($B$1,1)),countifs('Form
Responses'!$B:$B,$A3,'Form Responses'!$I:$I,"Yes",'Form
Responses'!$E:$E,">="&$B$1,'Form
Responses'!$E:$E,"<"&edate($B$1,1)),countifs('Form
Responses'!$B:$B,$A3,'Form Responses'!$J:$J,"Yes",'Form
Responses'!$E:$E,">="&$B$1,'Form
Responses'!$E:$E,"<"&edate($B$1,1)),countifs('Form
Responses'!$B:$B,$A3,'Form Responses'!$K:$K,"Yes",'Form
Responses'!$E:$E,">="&$B$1,'Form
Responses'!$E:$E,"<"&edate($B$1,1)))/(countifs('Form
Responses'!$B:$B,$A3,'Form Responses'!$E:$E,">="&$B$1,'Form
Responses'!$E:$E,"<"&edate($B$1,1))*6),0)
It works, but it feels like a bit of a brute-force-and-ignorance solution. I've tried countifs & array, I've looked a pivot but I can't get the section groups, I've had a play with query but I can't figure out how to ask it to count all Yeses in multiple columns at once.
Is there a more elegant solution, or do I have to resign myself to setting up the next financial year's summaries like this?
Edit:

You can use plain array boolean multiplication to achieve the count, as trues are converted to 1s and false are converted to 0s:
=TO_PERCENT(ARRAYFORMULA(
SUM((f!F1:K="Yes")*(f!E1:E>=B1)*(f!E1:E<EDATE(B1,1))*(f!B:B=A3))/
SUM(6*(f!E1:E>=B1)*(f!E1:E<EDATE(B1,1))*(f!B:B=A3))
)
)
Renamed Form Responses to f
Numerator: SUM of
Question filter (f!F:K =Yes) and
Month filter (f!E:E is within month of B1) and
Team filter(B:B = A3)
Denominator: 6 times the SUM of
Month filter (f!E:E is within month of B1) and
Team filter(B:B = A3)

On this sample sheet that you provided you'll notice two new tabs. MK.Retab and MK.Summary.
On MK.Retab is a single formula in A2 that "re-tabulates" all of your survey data into a format that is much easier to analyze going forward. That tab can be "hidden" on your real project. It will continue to build the 6 column dataset forever. It would be a sort of "back end" sheet, only used to supply data to any further downstream analysis.
On MK.Summary is a single formula in cell A1 that Query's that dataset from MK.Retab and shows the percentage of Yes's by month by section by team in a format similar to what you proposed. I coded it to display the most recent month at the left, immediately to the right of the team names, and to push historical data off to the right. Even though people are often used to seeing time go from left to right, I find that the opposite method nice because it keeps you from having to scroll sideways to see the most recent data. It is very simple to change should you want to by getting rid of the "desc" that you find in the "order by" clause of the query string.
I find this kind of two step solution to problems like your useful, because while the summary migth not be exactly what you want, it's always easier to build formulas and analyses off of the data as laid out in the MK.Retab sheet.
As for the formula in MK.Retab, it is based on a method that I came up with a while back that constructs a large vlookup where the [search key] is actually a sequence of decimal numbers that is built by counting the number of rows in your real data set and multiplying by the number of columns of data that need to be repeated for each row. I built a demo some time ago that I'm happy to share with folks if you want to understand better how it works.
You said that your goal was to understand the formulas so that you could modify them going forward as needed. I'm not sure how easy that will be to do, but I can try my best to answer any questions you might have about the method or the solution generally.
What I can tell you is that some of the formulas are more complicated than they need to be because you just used Q1 Q2 Q3 etc instead of the actual questions. if you had a list of the questions asked somewhere (on some other tab, say), and what you wanted to call/name their corresponding "sections", it would make the formula significantly less complicated. As it stands, I had to use the appearance of the word "Comments", in row 1 to distinguish between where one section ended and another section began. The upside to that decision though, is that the formula I wrote is infinitely expandable to the right. That is, if you were to add another 100 columns worth of questions and answers to the sample set here, the formula would be able to handle that and break it out, so long as there was the word "Comments" between each section.
Hope all this helps.

Creating Dynamic Sheet Cell Reference List for pulling numbers to SUM

I've been working on building a data analysis sheet, which is quite verbose at the moment and a bit more complicated than it should be as I've been trying to figure this out. Please note, I work doing student data in a school.
Basically, I have two sets of input data:
Data imported from a CSV file that includes test data and codes for Common Core Standards and the questions tied to those standards as a whole class summary
Data imported from a CSV file that includes individual scores by question
I am looking to construct 2 views:
A view that collates and displays data of individual standards per student that includes a dropdown to change the standard allowing a teacher to see class performance by standard in a broad view. The drop-down is populated dynamically from the input data (so staff could eventually dump data and go directly to reports)
A view that collates and displays data of individual students broken down by performance on each standard allowing a teachers to see the broader spectrum for each student. The student drop-down is populated from Source list 2.
I have been able to build the first view, but am struggling with the second. I've been able to separate the question codes and develop strings of cell references to the scoring data, including a dynamic reference to the row the selected student's score data appears on in the second source set from above.
I tried to pass through an indirect() formula into a sum() so as to process for a mean evaluation, and have encountered errors. I think SUM() doesn't process comma-separated cell reference lists from Indirect() [or in general] or there is something that I am missing to help parse it. Here is the formula I have tried:
=Sum(vlookup(D7,CCCodeManip!$A:$C,3,false))
CCCodeManip!C:C includes the created text (based on the dynamic standards and question codes, etc), here's an example of what would be found there:
'M-ADI'!M17, 'M-ADI'!N17, 'M-ADI'!O17, 'M-ADI'!P17, 'M-ADI'!Q17, 'M-ADI'!R17, 'M-ADI'!J17
I need these to be dynamic so that teachers can input different sets of standards, question, and student data and the sheet automatically collates and reports it in uniform ways (with an upward bound of 20 standards as I currently have it built)
Here is a link to the sheet I built, with names and ID anonymized. There's a CRAP TON of sub-tabs, and that's really just being able to split apart and re-combine data neatly without things error-ing out due to data overlapping, aside from a few different attempts and different approaches to parse the cell reference strings.
The first two tabs are the current status of the data views. I plan to hide a bunch of the functional stuff that is there to help pull data accurately.
The 3rd and 4th tab are the source data sets. 5th is a modified version of source data that allows me to reference things better, and I've tried to arrange the sheets most relevant towards the front of the set.
https://docs.google.com/spreadsheets/d/1fR_2n60lenxkvjZSzp2VDGyTUO6l-3wzwaV4P-IQ_5Y/edit?usp=sharing
Some have a different approach? I am aware that I might be as far as I cn go with this and perhaps should consider scripts - my coding experience is a bit out of date and my strength is more with the formulas, but I can dig into things with some direction, if anyone can help.

Ok so I noticed something.
It seems the failure is in the indirect reference:
=indirect(CCCodeManip!C3)
The string I am trying to parse via indirect is going to be generated into something like this, dynamic from reference to other data:
'M-ADI'!M17, 'M-ADI'!N17, 'M-ADI'!O17, 'M-ADI'!P17, 'M-ADI'!Q17, 'M-ADI'!R17, 'M-ADI'!J17
The indirect returns the error that the above string is not a cell reference with the #REF code.
Can someone give me a clue as to what is causing this? I am going to dig into the docs on Indirect() from google and will post anything that I find.
Perhaps it is that indirect() can't handle lists, but only specific references and arrays, which may require me a to build a sheet to do the SUM formula on for each question set (?)

So I think I figured it out, but i Ended up parsing the data differently, basically doing the sum based on individual cell references and a separate sum formula, bypassing the need to do it all at once, it jsut makes my sheets a lot dirtier! I am eventually going to see if code could do it better if I need to, but this is closed for now.
Basically, I did individual cell references to recall scores in a row, then used a separate SUM formula, and created references / structures to be able to pull those sum() results. Achieves the same end, but with extra crap on the sheet.

The data in my Google Sheet is not appearing in the correct cell

I am trying to create a spreadsheet to simplify our account returns. I am using a variety of named ranges to make life easier. I have created a test sheet which automatically copies the inputted cost to its appropriate category.
I am having a strange issue where the cell I am expecting to see the data in is incorrect. I am wondering if the 2 data validation lists I have created could be causing the issue. I had originally copy / pasted from an old sheet but as I wondered if some strange formatting may have been carried over which is causing the issue I have since manually entered all data to remove this as a potential cause.
https://docs.google.com/spreadsheets/d/1KC8FsVNQZfWtey5TvPDxCxvDhRrFVHdsbDJWmZu73wg/edit#gid=0
This is the test sheet in question. The cost for entries Test 9 & Test 10 should be in the Info Books and Stationary sections retrospectively but they are ending up in the wrong places.
I am not a spreadsheet expert so I apologise if I am missing something blatently obvious. A friend advised me to ask on Stackoverflow after many hours lost to this problem.
Thanks in advance for any help you may be able to give.

Use the third parameter in vlookup set to false (or zero)
=IF(VLOOKUP(companyOfPurchase,suppliersAndCategories,2, 0) = typeOfPurchase,totalOfReceipt,"")
and see if that works?

Wikipedia pageviews analysis

I've been challenged with wikipedia pageviews analysis. For me this is the first project with such amount of data and I'm a bit lost. When I download the file from the link and unpack it, I can see that it has a table-like structure with rows looking like this:
1 | 2 |3|4
en.m The_Beatles_in_the_United_States 2 0
I struggle with finding out what exactly can be found in each column. My guesses:
language version and additional info (.m = mobile?)
name of the article
The biggest concern I have with two last columns. The last one has only "0" values in it and I have no idea what it represents. I'd assume then that the third one show number of views but I'm not sure.
I'd be grateful if someone could help me to understand what exactly can be found in each column or recommend some reading on this subject. Thanks!

After more time spent on this, I've finally found solution. I'm posting this in case someone has the same problem in the future. Wikipedia explains what can be found in database. These explanations were painful to find but you can access theme here and here.
Based on that you can see that rows have following structure:
domain code
page_title
count_views
total_response_size (no longer maintained)
Some explanations for each column:
Column 1:
Domain name of the request, abbreviated. (...) Domain_code now can
also be an abbreviation for mobile and zero domain names, in which
case .m or .zero is inserted as second part of the domain name (just
like with full domain name). E.g. 'en.m.v' stands for
"en.m.wikiversity.org".
Column 2:
For page-level files, it holds the title of the unnormalized part
after /wiki/ -in the request Url (E.g.: Main_Page Berlin). For
project-level files, it is - .
Column 3:
The number of times this page has been viewed in the respective hour.
Column 4:
The total response size caused by the requests for this page in the
respective hour. If I understand it correctly response size is
discontinued due to low accuracy. That's why there are only 0s. The
pagecounts and projectcounts files also include total response byte
sizes at their respective aggregation level, but this was dropped from
the pageviews and projectviews files because it wasn't very accurate.
Hope someone finds it useful.

Line format:
wiki code (subproject.project)
article title
monthly total (with interpolation when data is missing)
hourly counts
(From pagecounts-ez which is the same dataset just with less filtering.)
Apparently buggy though; it takes the first two parts of the domain name for wiki code, which does not work for mobile domains (which are in the form <language>.m.<project>.org).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart