Using google query to download parts of a published sheet - google-sheets

This works:
curl 'https://docs.google.com/spreadsheets/d/e/2PACX-1vS3iBtVf4i_won5zAN9NGPqhcd6CcTb-4QHxpisSjCmlgV95B6mFmZvtMaC9GPvD7m8kD-6XLkVAhfc/pub?gid=911257845&single=true&output=csv'
however I want to only pick up rows where count > 300.
The query before encoding would be
select * where F > 300
After encoding
select%20*%20where%20F%3E300
So the url becomes
https://docs.google.com/spreadsheets/d/e/2PACX-1vS3iBtVf4i_won5zAN9NGPqhcd6CcTb-4QHxpisSjCmlgV95B6mFmZvtMaC9GPvD7m8kD-6XLkVAhfc/pub?gid=911257845&output=csv&tq=select%20*%20where%20F%3E300
The line above works retrieves a file, but it returns the whole file, and doesn't filter.
Note that a published web sheet has the form
https://docs.google.com/spreadsheets/d/e/KEY/pub?gid=GID
https://docs.google.com/spreadsheets/d/e/2PACX-1vS3iBtVf4i_won5zAN9NGPqhcd6CcTb-4QHxpisSjCmlgV95B6mFmZvtMaC9GPvD7m8kD-6XLkVAhfc/pub?gid=911257845
This works. Adding &output=csv to it (no space before the &) works, and it downloads as a csv file. This opens in excel and shows the data in the table.
I tried this:
https://docs.google.com/spreadsheets/d/e/2PACX-1vS3iBtVf4i_won5zAN9NGPqhcd6CcTb-4QHxpisSjCmlgV95B6mFmZvtMaC9GPvD7m8kD-6XLkVAhfc/pub?gid=911257845&output=csv&tq=select%20*%20where%20F%3E%20300
and
https://docs.google.com/spreadsheets/d/e/2PACX-1vS3iBtVf4i_won5zAN9NGPqhcd6CcTb-4QHxpisSjCmlgV95B6mFmZvtMaC9GPvD7m8kD-6XLkVAhfc/gviz/tq?gid=911257845&output=csv&tq=select%20*%20where%20F%3E300
and get errors -- resource not available.
The page above should be public for people who want to try.
This may be an issue between publishing a sheet, and sharing a whole spread sheet to anyone who has the link.
I've created a new page that uses importrange() that slurps up the page from the main sheet, and that one is public.
https://docs.google.com/spreadsheets/d/1-lqLuYJyHAKix-T8NR8wV8ZUUbVOJrZTysccid2-ycs/edit?usp=sharing

How about this modification?
Modification points :
When it uses query, please use like https://docs.google.com/spreadsheets/d/### file ID ###/gviz/tq?gid=###&tq=### query ###.
When select%20*%20where%20%F%3E300 is decoded, it is select * where %F>300.
select * where F > 300 is select%20%2a%20where%20F%20%3e%20300.
In order to output CSV, please use tqx=out:csv.
Please share the Spreadsheet.
On Google Drive
On the Spreadsheet file
right-click -> Share -> Advanced -> Click "change" at "Private - Only you can access"
Check "On Anyone with the link"
Click "Save"
At "Link to share", copy URL.
Retrieve file ID from https://docs.google.com/spreadsheets/d/### file ID ###/edit?usp=sharing
Modified curl command :
curl 'https://docs.google.com/spreadsheets/d/### file ID ###/gviz/tq?gid=911257845&tq=select%20%2a%20where%20F%20%3e%20300&tqx=out:csv'
Reference :
Query Language Reference
If I misunderstand your question, I'm sorry.
Edit :
The following 2 URLs are the comparison between your URL and my answer. The URL of my answer was matched to your URL.
1. Your URL
https://docs.google.com/spreadsheets/d/e/2PACX-1vS3iBtVf4i_won5zAN9NGPqhcd6CcTb-4QHxpisSjCmlgV95B6mFmZvtMaC9GPvD7m8kD-6XLkVAhfc/gviz/tq?gid=911257845&output=csv&tq=select%20*%20where%20F%3E300
When above URL is separated,
https://docs.google.com/spreadsheets/d/e/
e/ is not required.
2PACX-1vS3iBtVf4i_won5zAN9NGPqhcd6CcTb-4QHxpisSjCmlgV95B6mFmZvtMaC9GPvD7m8kD-6XLkVAhfc
This is not the file ID of spreadsheet.
/gviz/tq
gid=911257845
output=csv
tq=select%20*%20where%20F%3E300
2. In my answer matched to your URL
https://docs.google.com/spreadsheets/d/### file ID ###/gviz/tq?gid=###&tqx=out:csv&tq=### query ###
When above URL is separated,
https://docs.google.com/spreadsheets/d/
### file ID ###
You can see the detail of the file ID of spreadsheet at here.
/gviz/tq
gid=###
You can use gid=911257845.
tqx=out:csv
This has to be used instead of output=csv.
tq=### query ###
You can use tq=select%20*%20where%20F%3E300.
Note :
Each number corresponds.
And please share the Spreadsheet as follows. This is difference from "Publish to the web" on Spreadsheet.
On Google Drive
On the Spreadsheet file
right-click -> Share -> Advanced -> Click "change" at "Private - Only you can access"
Check "On Anyone with the link"
Click "Save"
At "Link to share", copy URL.
Retrieve file ID from ``https://docs.google.com/spreadsheets/d/###

Related

Unable import text using importxml and xpath inside div

i'm Using Google Sheets with IMPORTXML to scrape a download count information from a japanese website via XPath in google sheet. I want to save the number/text inside this red box
here's the link
https://www.photo-ac.com/main/detail/4465781?title=%E3%82%A2%E3%82%B2%E3%83%8F%E8%9D%B6%E3%81%A8%E3%83%92%E3%83%A3%E3%82%AF%E3%83%8B%E3%83%81%E3%82%BD%E3%82%A6
here's my function
=IMPORTXML("https://www.photo-ac.com/main/detail/4465781?title=アゲハ蝶とヒャクニチソウ", "/html/body/div[17]/div/div/div/div[2]/div[7]/div[1]/div[1]/div/div[3]/div[2]/div[1]//text()")
the function doesn't work? why?
thank you
When I tested your formula, I confirmed that an error of Could not fetch url: occurred. But, fortunately, when Google Apps Script is used, I confirmed that the URL can be requested using UrlFetchApp. So, in this answer, I would like to propose to use Google Apps Script. The sample script is as follows.
Sample script:
Please copy and paste the following script to the script editor of Google Spreadsheet, and save it, and put a formula of =SAMPLE("URL") to a cell. If the function name is not found, please reopen the Google Spreadsheet and test it again. This script is used as the custom function.
function SAMPLE(url) {
const value = UrlFetchApp.fetch(url).getContentText().match(/ダウンロード:.+/);
if (!value) throw new Error("Value was not retrieved.");
return value;
}
Result:
When above script is used, the following result is obtained.
Note:
This sample script is for the current HTML of the URL of https://www.photo-ac.com/main/detail/4465781?title=アゲハ蝶とヒャクニチソウ. And, when the structure of HTML of the URL is changed, above script might not be able to be used. Please be careful this.
References:
Custom Functions in Google Sheets
fetch(url)

SEC company filings: Is the <SEC-HEADER> tag valid SGML? If so, how to parse it?

I tried to parse SEC company filings from sec.gov. Starting from fb 10-Q index.htm let's look at a complete text submission filing like complete submission text filing. It has a structure like:
<SEC-DOCUMENT>
<SEC-HEADER>
<ACCEPTANCE-DATETIME>"some content" This tag is not closed.
"some lines resembling yaml markup"
These are indented lines with a
"key": "value" structure.
</SEC-HEADER>
<DOCUMENT>
.
.
some content
.
.
</DOCUMENT>
"several DOCUMENT tags" ...
</SEC-DOCUMENT>
I tried to figure out the structure of the <SEC-HEADER> tag and found some information under Public Dissemination
Service (PDS) Technical
Specification (pdf) and concluded that the content of the header should be SGML.
Nevertheless, I am clueless about the formatting, since there are no angle brackets, and the keys - value paires are separated by colons like key: value instead of <key>value</key>. In the pdf link I could not find anything about colons.
Question: Is the <SEC-HEADER> tag valid SGML? If it is, how to parse it?
I'd be glad at any help.
The short answer is no. The <SEC-HEADER> tag in the raw filing is not a valid SGML.
However, it is my understanding that this section in the raw filing is parsed automatically from the header file <accession_num>.hdr.sgml, which does follow SGML. This header file can be found in the same directory as the raw filing (i.e., the <accession_num>.txt file).
I use a REGEX of the form: ^<(.+?)>(.+?)$ (with re.MULTILINE option) to capture each (tag, value) tuple and get the results directly in a dict().
I believe the only tag in that file that has a closing tag is the </FILER> tag, where there could be multiple filers in each filing. You can first extract those using a REGEX of the form: <FILER>(.+?)</FILER> and then employ the same REGEX as above to get the inner tags for each filer.
Note that other than 'FILER', there could be other tags, representing different relations of the entities to the filing. Those are 'ISSUER', 'SUBJECT COMPANY', 'FILED BY', 'FILED FOR', 'SERIAL COMPANY', 'REPORTING OWNER'.

How do some sites download YouTube captions?

This is somewhat of a duplicate question of Does YouTube API forbid to download video captions if you are not it's owner?, Get YouTube captions and Does YouTube API forbid to download video captions if you are not it's owner?, which all basically say it's not possible unless to download captions via the YouTube API unless you are the owner or third-party contributions are not enabled; however, my question is how to sites like http://downsub.com/ or http://www.lilsubs.com/ have access to all captions?
In other words, when I access the YouTube API myself (even with youtubepartner and youtube.force-ssl scopes), I can only download the captions of some videos, but when I try the same videos that failed for me with 403: The permissions associated with the request are not sufficient to download the caption track. The request might not be properly authorized, or the video order might not have enabled third-party contributions for this caption. on these other sites, it works fine. I'm assuming they are using the YouTube API to access the captions, but what special sauce are they using? Some special partner key? An different API version? Are they just scraping from the videos themselves or something?
Send a GET request on:
http://video.google.com/timedtext?lang={LANG}&v={VIDEOID}
Example for your video in comment: http://video.google.com/timedtext?lang=ko&v=0db1_qWZjRA
Let's look at another example of yours, i.e. https://www.youtube.com/watch?v=7068mw-6lmI (and I agree about differentiation part in your comment).
There are multiple subtitles available for the video
English
Korean
Spanish
Korean (auto-generated) also called asr (automatic speech recognition)
These stand for the subtitle name parameter (i.e., name=English).
lang stands for the country code.
In your example: https://www.youtube.com/api/timedtext?lang=es-MX&v=7068mw-6lmI&name=Spanish
If subtitle track is available, it is possible to do translation form it, namely using tlang parameter.
https://www.youtube.com/api/timedtext?lang=en&v=7068mw-6lmI&name=English&tlang=lv
https://www.youtube.com/api/timedtext?lang=ko&v=7068mw-6lmI&name=Korean&tlang=lv
This would be my bid for what these sites are using, i.e. translation of the available subtitle track (confirm by trying to use a video without subtitle track as input for one of their sites).
As for asr signature seems to always be needed, but as long as one of the subtitle tracks are available, you could use that for translation. E.g. in your OP comment example:
https://www.youtube.com/api/timedtext?lang=en&v=vx6NCUyg1NE&tlang=lv
Looks like the last example is special with both of subtitle tracks being asr (checked with Chrome -> Inspect -> Network) therefore you need to omit the subtitle name parameter part. This difference unfortunately is not visible in YouTube video's settings wheel.
A 2022 answer:
Option 1: Send a curl request to the webpage: curl -L "https://youtu.be/YbJOTdZBX1g", search for timedtext in the result, and you would get a URL. replace \u0026 with & and you get the link for the subtitle.
Option 2: Use the yt-dlp package:
# For installing see: https://github.com/yt-dlp/yt-dlp#with-pip
from yt_dlp import YoutubeDL
ydl_opts = {
"skip_download": True,
"writesubtitles": True,
"subtitleslangs": ["all", "-live_chat"],
# Looks like formats available are vtt, ttml, srv3, srv2, srv1, json3
"subtitlesformat": "json3",
# You can skip the following option
"sleep_interval_subtitles": 1,
}
with YoutubeDL(ydl_opts) as ydl:
ydl.download(["YbJOTdZBX1g"])
There is this unofficial API used by Youtube :
https://www.youtube.com/api/timedtext?lang={LANG}&v={VIDEO_ID}
LANG here is ISO 639-1 2 letter country code. For your example it would be :
https://www.youtube.com/api/timedtext?lang=ko&v=0db1_qWZjRA
You can check it in network tab while toggling the closed caption button :
I have used youtube-transcript-api successfully to retrieve transcripts. The below is a demo to dump the transcript into HTML with links back to the timestamps in the video:
import sys
from youtube_transcript_api import YouTubeTranscriptApi
video_id = sys.argv[1]
# Retrieve the available transcripts
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
# Just use the first transcript, let it raise an exception if none exist.
transcript = next(iter(transcript_list))
print("<html><body>")
for line_map in transcript.fetch():
st_sec = int(line_map['start'] / 60)
st_msec = int(line_map['start'] - st_sec * 60)
tstmp = f"{st_sec}:{st_msec}"
link_to_tstmp = f"https://youtu.be/{video_id}?t={st_sec*60}"
tstmp_str = ("%2d:%-2d" % (st_sec, st_msec)).replace(" ", " ")
#print(f"{st_sec}:{st_msec} {line_map['text']}")
print("""%s %s<br/>""" % (link_to_tstmp, tstmp_str, line_map['text']))
print("</html></body>")
If there are multiple transcripts, the library provides API to search by language etc.
You can further tweak the logic to merge text so you only get one link every so many minutes. I got good results for a lecture by linking at every 1 min and format the lines into a HTML table.

TCPDF generated PDF to automatically display in Acrobat Reader

I am able to use TCPDF and generate a PDF in the browser using JQuery/JavaScript:
window.open("", "pdfWindow",scrollbars=yes, resizable=yes, top=500, left=500, width=400, height=400");
$("#" + formID).attr('action','tcpdf/example/genReport.pdf').attr('target','pdfWindow');
In genReport.pdf, I am using $pdf->Output('genReport.pdf', 'I');
When genReport.pdf is generated, it appears in a new tab with the standard browser settings. I wanted to know if there is a way to have the generated PDF automatically display in Acrobat Reader?
Any help will be greatly appreciated.
According to the documentation of the output() function, the second parameter can be one of those:
I: send the file inline to the browser (default). The plug-in is used if available. The name given by name is used when one selects
the "Save as" option on the link generating the PDF.
D: send to the browser and force a file download with the name given by name.
F: save to a local server file with the name given by name.
S: return the document as a string (name is ignored).
FI: equivalent to F + I option
FD: equivalent to F + D option
E: return the document as base64 mime multi-part email attachment (RFC 2045)
So I'd suggest using $pdf->Output('genReport.pdf', 'D'); this will open the download dialog and the user can choose to either open or download the file.

Lucene/Solr unexpected query answer

I'm using Solr 4.4.0 running on Tomcat 7.0.29.
The solrconfig.xlm is as-delivered (excepted for the Solr home directory of course).
I could pass on the schema.xml, though I doubt this would help much, as the following will show.
If I select all documents containing "russia" in the text, which is the default field, ie if I execute the query "russia", I find only 1 document, which is correct.
If I select all documents containing "web" in the text ("web"), the result is 29, which is also correct.
If I search for all documents that do not contain "russia" ("NOT(russia)"), the result is still correct (202).
If I search for all documents that contain "web" and do not contain "russia" ("web AND NOT(russia)"), the result is, once again, correct (28, because the document containing "russia" also contains "web").
But if I search for all documents that contain "web" or do not contain "russia" ("web OR NOT(russia)"), the result is still 28, though I should get 203 matches (the whole set).
Has anyone got an explanation ??
For information, the AND and OR work correctly if I don't use a NOT somewhere in the query, i.e. :
"web AND russia" --> OK
"web OR russia" --> OK
I got a solution from Yonik Seeley, which is to translate the NOT(russia) into (*:* -russia), so that a positive value (: i.e. all documents) can be used to substract from (-russia). This solution works very well. I still believe that it would be a good idea to modify the parser so that the strainghtforward request "web OR NOT(russia)" would work without translation.

Resources