Dask Locality, how to read from a local worker file? - dask

I'm trying to read from each worker a unique local file, however I get the same result across all the workers, instead of a unique result from each worker....Can someone please point what I'm doing wrong ?
from dask.distributed import Client, progress
c = Client()
c
import dask.dataframe as dd
filename_1='/tmp/1990.csv'
filename_2='/tmp/1991.csv'
filename_3='/tmp/1992.csv'
future_1 = c.submit(dd.read_csv,filename_1 , workers='172.18.0.3')
future_2 = c.submit(dd.read_csv,filename_2 , workers='172.18.0.5')
future_3 = c.submit(dd.read_csv, filename_3 , workers='172.18.0.6')
future_1.result().head()
future_2.result().head()
future_3.result().head()
I will get the same result , instead of unique data from each one of them.

You probably want to use pandas.read_csv here rather than dask.dataframe.read_csv
https://docs.dask.org/en/latest/delayed-best-practices.html#don-t-call-dask-delayed-on-other-dask-collections

Related

How to get Twitter mentions id using academictwitteR package?

I am trying to create several network analyses from Twitter. To get the data, I used the academictwitteR package and their get_all_tweets command.
get_all_tweets(
users = c("LegaSalvini"),
start_tweets = "2007-01-01T00:00:00Z",
end_tweets = "2022-07-01T00:00:00Z",
file = "tweets_lega",
data_path = "tweetslega/",
bind_tweets = FALSE
)
## Binding JSON files into data.frame objects
tweets_bind_lega <- bind_tweets(data_path = "tweetslega/")
##Tyding
tweets_bind_lega_tidy <- bind_tweets(data_path = "tweetslega/", output_format = "tidy")
With this, I can easily access the ids for the creation of a retweet and reply network. However, the tidy format does not provide a tidy column for the mentions, instead it deletes them.
However, they are in my untidy df tweets_bind_lega , but stored as a list tweets_bind_afd$entities$mentions. Now I would like to somehow unnest this list and create a tidy df with a column that has contains the mentioned Twitter user ids.
Has anyone created a mention network with academictwitteR before and can help me out?
Thanks!

Web Scraping - BeautifulSoup parsers do not seem to work

I am trying to extract the name of a few items from the url below. The node and class_, point to the right content but when I use find_all , I do not get back any results. From previous posts it seems that this problem might be connected to using the wrong parser. I have used xml, lxml and others but nothing seems to work.
Is anyone able to extract the content successfully?
import requests
from bs4 import BeautifulSoup
import pandas as pd
import html5lib
import urllib3
url_pb = 'https://www.pullandbear.com/it/uomo/accessori/zaini-c1030207088.html'
req_pb = requests.get(url_pb)
pars_pb = BeautifulSoup(req_pb.content, 'html.parser')
con_pb = pars_pb.find_all('div', class_ = 'name namorio')
UPDATE
I have managed to find the info I needed, hidden in another section of the same code available to inspection. I have extracted them using this code:
url_pb = 'https://www.pullandbear.com/it/uomo/accessori/zaini-c1030207088.html'
req_pb = requests.get(url_pb)
pars_pb = BeautifulSoup(req_pb.content, 'html.parser')
con_pb = pars_pb.find_all('li', class_ = False)
names_pb = [c.select("a > p")[0].text for c in con_pb]
prices_pb = [c.select('a > p')[1].text for c in con_pb]
picts_pb = [c.find('img').get('src') for c in con_pb]
df_pb = pd.DataFrame({'(Pull&Bear) Item_Name': names_pb,
'Item_Price_EUR': prices_pb,
'Link_to_Pict': picts_pb })
It seems that the website is using javascript in order to display its content. Meaning that you can't directly visit the homepage and scrape the content (as the requests doesn't support javascript rendered websites). That being said all of the data displayed on the website is sent in the form of a JSON string, so in order to get all the names of the items you could use the following code:
import requests
url = "https://www.pullandbear.com:443/itxrest/2/catalog/store/24009405/20309428/category/1030207088/product?languageId=-4&appId=1"
all_products = requests.get(url).json()["products"]
product_names = [item["bundleProductSummaries"][0]["name"] for item in all_products]
print(product_names)
hope this helps

How to retrieve an XMLTYPE data that contains special characters?

I want to retrieve XMLTYPE data from an Oracle table using cx_oracle.
the data looks like this:
<infos>
<Comment/>
<Observation>àéèç</Observation>
<Level>L3</Level>
<Duration/>
<Cause/>
<Depot> Haren </Depot>
<Resolution/>
</infos>
Here's my code:
#!/usr/bin/python
from __future__ import print_function
import cx_Oracle
# Connection to RTDIAG
try:
dsn_test = cx_Oracle.makedsn(host='xxxxx',port='1521',service_name='xxxxx')
con_test = cx_Oracle.connect(user='xxxx', password='xxxxx',dsn=xxxx)
except cx_Oracle.InterfaceError:
print ("Impossible to connect to the DB!")
print ("***exit script***")
quit()
ID_record = 1729
cursor = con_test.cursor()
query = """select a.content.getClobVal() from emb_log a where ID = :id and uncompleted_record=1
"""
cursor.execute(query,id=1729)
xml_retrieved = cursor.fetchone()[0].read() #string
print (xml_retrieved)
Here's what I get
<infos>
<Comment/>
<Observation>aeec</Observation>
<Level>L3</Level>
<Duration/>
<Cause/>
<Depot> Haren </Depot>
<Resolution/>
</infos>
The special characters contained within the XML child is not being retrieved proprely. They are converted in 'ascii like' characters.
Why and how to fetch the XML exactly the way it appears in the DB ?
Thank you.
Set your NLS environment. You will probably find it easiest to use
the
encoding
option when you connect.
for performance, you will want to fetch the CLOB via an OutputTypeHandler

How broadcast variables are used in dask parallelization

I have some code applying a map function on a dask bag. I need a lookup dictionary to apply that function and it doesn't work with client.scatter.
I don't know if I am doing the right things, because the workers starts, but they don't do anything. I have tried different configuration looking to different examples, but I can't get it to work. Any support will be appreciated.
I know from Spark, you define a broadcast variable and you access the content by variable.value inside the function you want to apply. I don't see the same with dask.
# Function to map
def transform_contacts_add_to_historic_sin(data,historic_dict):
raw_buffer = ''
line = json.loads(data)
if line['timestamp] > historic_dict['timestamp]:
raw_buffer = raw_buffer + line['vid']
return raw_buffer
# main program
# historic_dict is a dictionary previously filled, which is the lookup variable for map function
# file_records will be a list of json.dump getting from a S3 file
from distributed import Client
client = Client()
historic_dict_scattered = client.scatter(historic_dict, broadcast=True)
file_records = []
raw_data = s3_procedure.read_raw_file(... S3 file.......)
data = TextIOWrapper(raw_data)
for line in data:
file_records.append(line)
bag_chunk = db.from_sequence(file_records, npartitions=16)
bag_transform = bag_chunk.map(lambda x: transform_contacts_add_to_historic(x), args=[historic_dict_scattered])
bag_transform.compute()
If your dictionary is small you can just include it directly
def func(partition, d):
return ...
my_dict = {...}
b = b.map(func, d=my_dict)
If it's large then you might want to wrap it up in Dask delayed first
my_dict = dask.delayed(my_dict)
b = b.map(func, d=my_dict)
If it's very large then yes, you might want to scatter it first (though I would avoid this if things work out with either of the approaches above).
[my_dict] = client.scatter([my_dict])
b = b.map(func, d=my_dict)

TypeError when attempting to parse pubmed EFetch

I'm new to this python/biopyhton stuff, so am struggling to work out why the following code, pretty much lifted straight out of the Biopython Cookbook, isn't doing what I'd expect.
I'd have thought it'd end up with the interpreter display two list containing the same number, but all i get is one list and then a message saying TypeError: 'generator' object is not subscriptable.
I'm guessing something is going wrong with the Medline.parse step and the result of the efetch isn't being processed in a way that allows subsequent interation to extract the PMID values. Or, the efetch isn't returning anything.
Any pointers at to what I'm doing wrong?
Thanks
from Bio import Medline
from Bio import Entrez
Entrez.email = 'A.N.Other#example.com'
handle = Entrez.esearch(db="pubmed", term="biopython")
record = Entrez.read(handle)
print(record['IdList'])
items = record['IdList']
handle2 = Entrez.efetch(db="pubmed", id=items, rettype="medline", retmode="text")
records = Medline.parse(handle2)
for r in records:
print(records['PMID'])
You're trying to print records['PMID'] which is a generator. I think you meant to do print(r['PMID']) which will print the 'PMID' entry in the current record dictionary object for each iteration. This is confirmed by the example given in the Bio.Medline.parse() documentation.

Resources