How to programatically get holdings of an ETF - yql

I am looking for a way to get the holding list of an ETF via a web service such as yahoo finance. So far, YQL has not yielded the desired results.
As an example ZUB.TO is an ETF that has holdings. here is a list of the holdings by querying the yahoo.finance.quotes we do not get the proper information.
The result.
Is there another table somewhere that would contain the holdings?

Perhaps downloading from Yahoo Finance is not working and/or may not work.
Instead how about using the various APIs the ETF providers already have for downloading the Excel or CSV files of the holdings?
Use the "append_df_to_excel" file as file to import, and then use the code below to make Excel file for all the 11 Sector SPDRs provided by SSgA (State Street global Advisors).
Personally I use this for doing breadth analysis.
import pandas as pd
import append_to_excel
# https://stackoverflow.com/questions/20219254/how-to-write-to-an-existing-excel-file-without-overwriting-data-using-pandas
##############################################################################
# Author: Salil Gangal
# Posted on: 08-JUL-2018
# Forum: Stack Overflow
##############################################################################
output_file = 'C:\my_python\SPDR_Holdings.xlsx'
base_url = "http://www.sectorspdr.com/sectorspdr/IDCO.Client.Spdrs.Holdings/Export/ExportExcel?symbol="
data = {
'Ticker' : [ 'XLC','XLY','XLP','XLE','XLF','XLV','XLI','XLB','XLRE','XLK','XLU' ]
, 'Name' : [ 'Communication Services','Consumer Discretionary','Consumer Staples','Energy','Financials','Health Care','Industrials','Materials','Real Estate','Technology','Utilities' ]
}
spdr_df = pd.DataFrame(data)
print(spdr_df)
for i, row in spdr_df.iterrows():
url = base_url + row['Ticker']
df_url = pd.read_excel(url)
header = df_url.iloc[0]
holdings_df = df_url[1:]
holdings_df.set_axis(header, axis='columns', inplace=True)
print("\n\n", row['Ticker'] , "\n")
print(holdings_df)
append_df_to_excel(output_file, holdings_df, sheet_name= row['Ticker'], index=False)
Image of Excel file generated for SPDRs

Related

How to get Twitter mentions id using academictwitteR package?

I am trying to create several network analyses from Twitter. To get the data, I used the academictwitteR package and their get_all_tweets command.
get_all_tweets(
users = c("LegaSalvini"),
start_tweets = "2007-01-01T00:00:00Z",
end_tweets = "2022-07-01T00:00:00Z",
file = "tweets_lega",
data_path = "tweetslega/",
bind_tweets = FALSE
)
## Binding JSON files into data.frame objects
tweets_bind_lega <- bind_tweets(data_path = "tweetslega/")
##Tyding
tweets_bind_lega_tidy <- bind_tweets(data_path = "tweetslega/", output_format = "tidy")
With this, I can easily access the ids for the creation of a retweet and reply network. However, the tidy format does not provide a tidy column for the mentions, instead it deletes them.
However, they are in my untidy df tweets_bind_lega , but stored as a list tweets_bind_afd$entities$mentions. Now I would like to somehow unnest this list and create a tidy df with a column that has contains the mentioned Twitter user ids.
Has anyone created a mention network with academictwitteR before and can help me out?
Thanks!

Saving SEC 10-K annual report text to files (trouble with decoding)

I am trying to bulk-download the text visible to the "end-user" from 10-K SEC Edgar reports (don't care about tables) and save it in a text file. I have found the code below on Youtube, however I am facing 2 challenges:
I am not sure if I am capturing all text, and when I print the URL from below, I receive very weird output (special characters e.g., at the very end of the print-out)
I can't seem to save the text in txt files, not sure if this is due to encoding (I am entirely new to programming).
import re
import requests
import unicodedata
from bs4 import BeautifulSoup
def restore_windows_1252_characters(restore_string):
def to_windows_1252(match):
try:
return bytes([ord(match.group(0))]).decode('windows-1252')
except UnicodeDecodeError:
# No character at the corresponding code point: remove it.
return ''
return re.sub(r'[\u0080-\u0099]', to_windows_1252, restore_string)
# define the url to specific html_text file
new_html_text = r"https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt"
# grab the response
response = requests.get(new_html_text)
page_soup = BeautifulSoup(response.content,'html5lib')
page_text = page_soup.html.body.get_text(' ',strip = True)
# normalize the text, remove characters. Additionally, restore missing window characters.
page_text_norm = restore_windows_1252_characters(unicodedata.normalize('NFKD', page_text))
# print: this works however gives me weird special characters in the print (e.g., at the very end)
print(page_text_norm)
# save to file: this only gives me an empty text file
with open('testfile.txt','w') as file:
file.write(page_text_norm)
Try this. If you take the data you expect as an example, it will be easier for people to understand your needs.
from simplified_scrapy import SimplifiedDoc,req,utils
url = 'https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt'
html = req.get(url)
doc = SimplifiedDoc(html)
# text = doc.body.text
text = doc.body.unescape() # Converting HTML entities
utils.saveFile("testfile.txt",text)

Web Scraping - BeautifulSoup parsers do not seem to work

I am trying to extract the name of a few items from the url below. The node and class_, point to the right content but when I use find_all , I do not get back any results. From previous posts it seems that this problem might be connected to using the wrong parser. I have used xml, lxml and others but nothing seems to work.
Is anyone able to extract the content successfully?
import requests
from bs4 import BeautifulSoup
import pandas as pd
import html5lib
import urllib3
url_pb = 'https://www.pullandbear.com/it/uomo/accessori/zaini-c1030207088.html'
req_pb = requests.get(url_pb)
pars_pb = BeautifulSoup(req_pb.content, 'html.parser')
con_pb = pars_pb.find_all('div', class_ = 'name namorio')
UPDATE
I have managed to find the info I needed, hidden in another section of the same code available to inspection. I have extracted them using this code:
url_pb = 'https://www.pullandbear.com/it/uomo/accessori/zaini-c1030207088.html'
req_pb = requests.get(url_pb)
pars_pb = BeautifulSoup(req_pb.content, 'html.parser')
con_pb = pars_pb.find_all('li', class_ = False)
names_pb = [c.select("a > p")[0].text for c in con_pb]
prices_pb = [c.select('a > p')[1].text for c in con_pb]
picts_pb = [c.find('img').get('src') for c in con_pb]
df_pb = pd.DataFrame({'(Pull&Bear) Item_Name': names_pb,
'Item_Price_EUR': prices_pb,
'Link_to_Pict': picts_pb })
It seems that the website is using javascript in order to display its content. Meaning that you can't directly visit the homepage and scrape the content (as the requests doesn't support javascript rendered websites). That being said all of the data displayed on the website is sent in the form of a JSON string, so in order to get all the names of the items you could use the following code:
import requests
url = "https://www.pullandbear.com:443/itxrest/2/catalog/store/24009405/20309428/category/1030207088/product?languageId=-4&appId=1"
all_products = requests.get(url).json()["products"]
product_names = [item["bundleProductSummaries"][0]["name"] for item in all_products]
print(product_names)
hope this helps

How to export a csv from Google Sheet API?

I can't find any reference to an API that enables Rest API clients to export an existing Google Sheet to a csv file.
https://developers.google.com/sheets/
I believe there should be a way to export them.
The following URL gives you the CSV of a Google spreadsheet per sheet. The sheet must be accessible by the public, by anyone with the link (unlisted).
The parameters you need to provide are:
sheet ID (that is simply the ID in the URL of a Google Spreadsheet https://docs.google.com/spreadsheets/d/{{ID}}/edit)
sheet name (that is simply the name of the sheet as given by the user)
https://docs.google.com/spreadsheets/d/{{ID}}/gviz/tq?tqx=out:csv&sheet={{sheet_name}}
With that URL you can run a GET-request to fetch the CSV.
Or paste it in your browser address bar.
You can use the Drive API to do this today -- see https://developers.google.com/drive/v3/web/manage-downloads#downloading_google_documents, however that will limit you to the first sheet of the document. The Sheets API doesn't expose exporting as CSV today, but may offer it in the future.
Nobody's mentioned gspread yet, so here's how I did it:
#open sheet
sheet = gc.open_by_key(sheet_id)
#select worksheet
worksheet = sheet.get_worksheet(0)
#download values into a dataframe
df = pd.DataFrame(worksheet.get_all_records())
#save dataframe as a csv, using the spreadsheet name
filename = sheet.title + '.csv'
df.to_csv(filename, index=False)
Firstly you should make document accessible for anyone. Then you get url. From this url you should extract long id composed from big and small letters and numbers. Then use this script.
#!/bin/bash
long_id="id_assigned_to_your_document"
g_id="number_assigned_to_card_in_google_sheet"
wget --output-document=temp.csv "https://docs.google.com/spreadsheets/d/$long_id/export?gid=$g_id&format=csv&id=$long_id"
If you use only one card in document, their number is: g_id="0"
The problem you will probably have is connected with strange spaces in obtained file. I use this second script to process it
#!/bin/bash
#Delete all lines beginning with a # from a file
#http://stackoverflow.com/questions/8206280/delete-all-lines-beginning-with-a-from-a-file
sed '/^#/ d' temp.csv |
# reomve spaces
# http://stackoverflow.com/questions/9953448/how-to-remove-all-white-spaces-from-a-given-text-file
tr -d "[:blank:]" |
# regexp "1,2" into 1.2
# http://www.funtoo.org/Sed_by_Example,_Part_2
sed 's/\"\([−]\?[0-9]*\),\([0-9]*\)\"/\1.\2/g' > out.csv
Update
As Sam mentioned, api is better solution. There is now great documentation on address:
https://developers.google.com/sheets/quickstart/php
With example that generate output having CSV structure.
If you don't have easy access to or familiarity with PHP, here's a very barebones Google Apps Script Web App that once deployed and the caller permission accepted, should allow clients with an appropriately scoped access token or api key to export an existing Google Sheet to a csv file. It takes a Google Sheets spreadsheet id and sheet name (and optional download filename) as query parameters, and returns the corresponding theoretically RFC 4180 compliant CSV file.
Further instructions on deploying an Apps Script project as a web app are here: https://developers.google.com/apps-script/guides/web#deploying_a_script_as_a_web_app.
You can deploy it and test it out easily in the browser just by visiting the "Current web app URL" (as provided when you publish as web app from the script editor), and accepting the consent screen, or even just visit the one that I deployed (configured to execute as the accessing user, and unverified/scary consent) at the example URL.
The tricky part (as usual) is getting the OAuth token or API key set up, but if you're already calling the Google Sheets V4 API, you've probably already got that dialed in. I used CURL to make sure that it behaved as a REST api, but the technique I used to get an OAuth token there is both a distraction and frankly a little scary to include here since it's really easy to mess up. If you don't already have a way to get one, that's probably a good topic for a separate SO question in any case.
One related (and big!) caveat: I'm not 100% sure how the consent and verification interact with a pure Rest client (i.e. how that works if you DON'T visit this in the browser first...), and/or whether this script would need to be in the same GCP project as the other code that uses the Sheets API. If there's interest, and/or it doesn't work right out of the box, please let me know and I'll happily dig deeper and follow up.
// Example URL, assuming:
// "Current web app URL": https://script.google.com/a/tillerhq.com/macros/s/AKfycbyZlWAW6bpCpnFoPjbdjznDomFRbTNluG4siCBMgOy2qU2AGoA/exec
// spreadsheetId: 1xNDWJXOekpBBV2hPseQwCRR8Qs4LcLOcSLDadVqDA0E
// sheet name: Sheet1
// (optional) filename: mycsv.csv
//
// https://script.google.com/a/tillerhq.com/macros/s/AKfycbyZlWAW6bpCpnFoPjbdjznDomFRbTNluG4siCBMgOy2qU2AGoA/exec?spreadsheetid=1xNDWJXOekpBBV2hPseQwCRR8Qs4LcLOcSLDadVqDA0E&sheetname=Sheet1&filename=mycsv.csv?spreadsheetid=1xNDWJXOekpBBV2hPseQwCRR8Qs4LcLOcSLDadVqDA0E&sheetname=Sheet1&filename=mycsv.csv
//
var REQUIRED_PARAMS = [
'spreadsheetid', // example: "1xNDWJXOekpBBV2hPseQwCRR8Qs4LcLOcSLDadVqDA0E"
'sheetname' // Case-sensitive; example: "Sheet1"
];
// Returns an RFC 4180 compliant CSV for the specified sheet in the specified spreadsheet
function doGet(e) {
REQUIRED_PARAMS.forEach(function(requiredParam) {
if (!e.parameters[requiredParam]) throw new Error('Missing required parameter ' + requiredParam);
});
var spreadsheet = SpreadsheetApp.openById(e.parameters.spreadsheetid);
var sheet = spreadsheet.getSheetByName(e.parameters.sheetname);
if (!sheet) throw new Error("Could not find sheet " + e.parameters.sheetname + " in spreadsheet " + e.parameters.spreadsheetid);
var filename = e.parameters.filename || (spreadsheet.getName() + "_" + e.parameters.sheetname + ".csv");
var numRows = sheet.getLastRow();
var numColumns = sheet.getLastColumn();
var values = sheet.getSheetValues(1, 1, numRows, numColumns);
function quote(s) {
s = s.toString();
if ((s.indexOf("\r") == -1)
&& (s.indexOf("\n") == -1)
&& (s.indexOf(",") == -1)
&& (s.indexOf("\"") == -1)) return s;
// Fields containing line breaks (CRLF)*, double quotes, and commas should be enclosed in double-quotes;
// anything other than that we already returned, so if we get here -- escape it and quote it.
// *That's what the text of the RFC says, but the ABNF (...and Excel) treat EITHER CR or LF as requiring quotes.
// Replace any double quote with a double double quote, and wrap the whole thing in quotes
return "\"" + s.replace(/"/g, '""') + "\"";
};
var csv = values.map(function(row) {
return row.map(quote).join();
}).join("\r\n") + "\r\n";
return ContentService
.createTextOutput(csv)
.setMimeType(ContentService.MimeType.CSV)
.downloadAsFile(filename);
}

New Google Spreadsheets publish limitation

I am testing the new Google Spreadsheets as there is a new feature I really need: the 200 sheets limit has been lifted (more info here: https://support.google.com/drive/answer/3541068).
However, I can't publish a spreadsheet to CSV like you can in the old version. I go to 'File>Publish to the web' and there is no more options to publish 'all sheets' or certain sheets and you can't specify cell ranges to publish to CSV etc.
This limitation is not mentioned in the published 'Unsupported Features' documentation found at: https://support.google.com/drive/answer/3543688
Is there some other way this gets enabled or has it in fact been left out of the new version?
My use case is: we retrieve Bigquery results into the spreadsheets, we publish the sheets as a CSV automatically using the "publish automatically on update" feature which then produces the CSV URL which gets placed into charting tools that read the CSV URL to generate the visuals.
Does anyone know how to do this?
The new Google spreadsheets use a different URL (just copy your <KEY>):
New sheet : https://docs.google.com/spreadsheets/d/<KEY>/pubhtml
CSV file : https://docs.google.com/spreadsheets/d/<KEY>/export?gid=<GUID>&format=csv
The GUID of your spreadsheet relates to the tab number.
/!\ You have to share your document using the Anyone with the link setting.
Here is the solution, just write it like this:
https://docs.google.com/spreadsheets/d/<KEY>/export?format=csv&id=<KEY>
I know it's weird to write the KEY twice, but it works perfectly. A teammate from work discovered this by opening the excel file in Google Docs, then File -> Download as -> Comma separated values. Then, in the downloads section of the browser appears a link to the CSV file, like this:
https://docs.google.com/spreadsheets/d/<KEY>/export?format=csv&id=<KEY>&gid=<SOME NUMBER>
But it doesn't work in this format, what my friend did was remove "&gid=<SOME NUMBER>" and it worked! Hope it helps everyone.
If you enable "Anyone with the link sharing" for spreadsheet, here is a simple method to get range of cells or columns (or whatever your feel like) export in format of HTML, CSV, XML, JSON via the query:
https://docs.google.com/spreadsheet/tq?key=YOUR-KEY&gid=1&tq=select%20A,%20B&tqx=reqId:1;out:html;%20responseHandler:webQuery
For tq variable read query language reference.
For tqx variable read request format reference.
Downside to this is that your doc is still availble in full via the public link, but if you want to export/import data to say Excel this is a perfect way.
It's not going to help everyone, but I've made a PHP script to read the HTML into an array.
I've added converting back to a CSV at the end. Hopefully this will help some people who have access to PHP.
$html_link = "https://docs.google.com/spreadsheets/d/XXXXXXXXXX/pubhtml";
$local_html = "sheets.html";
$file_contents = file_get_contents($html_link);
file_put_contents($local_html,$file_contents);
$dom = new DOMDocument();
$html = #$dom->loadHTMLFile($local_html); //Added a # to hide warnings - you might remove this when testing
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
$cols = $rows->item(0)->getElementsByTagName('td'); //You'll need to edit the (0) to reflect the row that your headers are in.
$row_headers = array();
foreach ($cols as $i => $node) {
if($i > 0 ) $row_headers[] = $node->textContent;
}
foreach ($rows as $i => $row){
if($i == 0 ) continue;
$cols = $row->getElementsByTagName('td');
$row = array();
foreach ($cols as $j => $node) {
$row[$row_headers[$j]] = $node->textContent;
}
$table[] = $row;
}
//Convert to csv
$csv = "";
foreach($table as $row_index => $row_details){
$comma = false;
foreach($row_details as $value){
$value_quotes = str_replace('"', '""', $value);
$csv .= ($comma ? "," : "") . ( strpos($value,",")===false ? $value_quotes : '"'.$value_quotes.'"' );
$comma = true;
}
$csv .= "\r\n";
}
//Save to a file and/or output
file_put_contents("result.csv",$csv);
print $csv;
Here is another temporary, non-PHP workaround:
Go to an existing NEW google sheet
Go to "File -> New -> Spreadsheet"
Under "File -> Publish to the web..." now has the option to publish a csv version
I believe this is actually creating an old Google sheet but for my purposes (importing google sheet data from clients or myself into R for statistical analysis) it works until they hopefully update this feature.
I posted this in a Google Groups forum also, please find it here:
https://productforums.google.com/forum/#!topic/docs/An-nZtjaupU
The correct URL for downloading a Google spreadsheet as CSV is:
https://docs.google.com/spreadsheets/export?id=<ID>&exportFormat=csv
The current answers do not work anylonger. The following has worked for me:
Do File -> "Publish to the web" and select 'start publishing' and the format. I choose text (which is TSV)
Now just copy the URL there which will be similar to https://docs.google.com/spreadsheet/pub?key=YOUR_KEY&single=true&gid=0&output=txt
That new feature appears to have disappeared. I don't see any option to publish a csv/tsv version. I can download tsv/csv with the export, but that's not available to other people with merely the link (it redirects them to a google docs sign-in form).
I found a fix! So I discovered that old spreadsheets before this change were still allowing only publishing certain sheets. So I made a copy of an old spreadsheet, cleared the data out, copy and pasted my current info into it and now I'm happily publishing just a single sheet of my large spreadsheet. Yay
I was able to implement a query to the result, see this table
https://docs.google.com/spreadsheets/d/1LhGp12rwqosRHl-_N_N8eTjTwfFsHHIBHUFMMyhLaaY/gviz/tq?tq=select+A,B,I,J,K+where+B%3E=4.5&pli=1
the spreadsheet fetches data from earthquake, but I just want to select MAG 4.5+ earthquakes so it makes the query and the columns, just a problem:
I cannot parse the result, I tried to decode as json but was not able to parse it.
I would like to be able to show this as HTML or CSV or how to parse this ? for example to be able to plot it on a Google Map.

Resources