I was wondering if anyone knew the proper way to write out a group of files based on the value of a column in Dask. In other words, if I want to group a bunch of columns based on a value in a column and write those out to CSVs. I've been trying to use the groupby-apply paradigm with Dask, but the problem is that it does not return a dask.dataframe object, so the function I apply it with uses the Pandas API.
Is there a better way to approach what I'm trying to do? A scalable solution would be much appreciated because some of the data that I'm dealing with is very large.
Thanks!
If you were saving to parquet, then partition_on kwarg would be useful. If you are saving to csv, then it's possible to do something similar with (rough pseudocode):
def save_partition(df, partition_info=None):
for group_label, group_df in df.groupby('some_col'):
csv_name = f"{group_label}_partition_{partition_info['number']}.csv"
group_df.to_csv(csv_name)
delayed_save = ddf.map_partitions(save_partition)
The delayed_save can then be computed when convenient.
Related
I have created dataframe with non-sorted index with pandas and save it to parquet. Later, if I load with dask, How do I perform sort index? Do I have to do something like,
pdf.reset_index().set_index(idx)?
As far as I am aware, the answer is yes, your approach is correct. For example, searching for "sort_index" in Dask issues does not really yield any relevant results.
Keep in mind that sorting out-of-core is quite a difficult operation. It's possible you might get more stable results (or even better performance) in Pandas if your dataset fits in your memory.
I have a data structure in the form of a tree. So it has vertices within vertices. Neo4j would be a perfect match but alas someone has made a decision that a property can not be a dictionary/map.
I find this strange. Neo4j is all about vertices. So why not accept tree shaped data?
It would seem so intuitive.
I guess it must be for a good reason. Can it be difficult to manage updates? Or handling memory?
Does anyone know?
And does anyone know an alternative to Neo4j that can store a tree-structure? Or maybe an addon or something that handles that?
The presence of a map in your properties implies that the data structure is not fully converted to a graph. The node (:N {p: map}) implies the structure: (:N)-->(:P {map}). With the former structure you'd need to query items in the map using something like match (n:N) where n.p.k = v which I imagine would be a nightmare for indexing, etc. With the latter you can simply match (:N)-->(p:P) where p.k = v.
I just want your help over a issue, about how I come to know that there are missing values specially in big data sets i.e. which columns having missing values and whose not?
This depends entirely on how the dataset is stored (if it’s at rest as a disk file), or what interface is it accessible through (SQL, graph query, etc).
If it’s a “plain file” like CSV, HDF, an Octave/Matlab matrix, then use whatever scripting tool you’re comfortable with to iterate the rows and check for missing values. If it’s an SQL dump, you can load it into SQLite or sql server and select for missing values. You could even use an SQL parser to directly report missing values from the SQL dump, since there’s really no need to persist it into a database.
If it’s live data behind an API, you can use the api to query the data for missing values – if the api supports such queries. Otherwise, use the api to export (dump) the entire data set, and query it at rest as in preceding paragraph. If the dataset doesn’t have indices that allow finding missing data, then you’ll expect the query to take long, and possibly have performance impact on the service that provides the data – act with care and understand the exact consequences of what you’re about to do.
This gives number of missing values of each column. Use your pandas dataframe instead of train.
train.isnull().sum()
Otherwise you can use train.info() or train.describe() for complete information or description of data, which also shows missing values in each column.
Number of missing values for entire dataset df.isnull().sum().sum()
I have data containing candidates who look for a job. The original data I got was a complete mess but I managed to enhance it. Now, I am facing an issue which I am not able to resolve.
One candidate record looks like
https://i.imgur.com/LAPAIbX.png
Since ML algorithms cannot work with categorical data, I want to encode this. My goal is to have a candidate record looking like this:
https://i.imgur.com/zzsiDzy.png
What I need to change is to add a new column for each possible value that exists in Knowledge1, Knowledge2, Knowledge3, Knowledge4, Tag1, and Tag2 of original data, but without repetition. I managed to encode it to get way more attributes than I need, which results in an inaccurate model. The way I tried gives me newly created attributes Jscript_Knowledge1, Jscript_Knowledge2, Jscript_Knowledge3 and so on, for each possible option.
If the explanation is not clear enough please let me know so that I could explain it further.
Thanks and any help is highly appreciated.
Cheers!
I have some understanding of your problem based on your explanation. I will try and elaborate how I would approach this problem. If that is not solving your problem, I may need more explanation to understand your problem. Lets get started.
For all the candidate data that you would have, collect a master
skill/knowledge list
This list becomes your columns
For each candidate, if he has this skill, the column becomes 1 for his record else it stays 0
This is the essence of one hot encoding, however, since same skill is scattered across multiple columns you are struggling with autoencoding it.
An alternative approach could be:
For each candidate collect all the knowledge skills as list and assign it into 1 column for knowledge and tags as another list and assign it to another column instead of current 4(Knowledge) + 2 (tags).
Sort the knowledge(and tag) list alphabetically within this column.
Auto One hot encoding after this may yield smaller columns than earlier
Hope this helps!
How does one get records -- by value -- in a more efficient way?
Currently I’m doing this:
Coupon = [P || P <- kvs:all(company_coupon), P#company_coupon.company_id == C#company.id],
My question is geared at kvs:all(...). In databases it is usually pretty expensive to get all entries first and then match them.
Is there a better way?
PS: "lists:keyfind" also needs to be provided with ALL records first, to then run them through the loop.
How are you guys doing it?
Cheers!
One can use kvs:index(table, field, value) if one has set the field as a key before:
#table{name=user,fields=record_info(fields,user), keys = [field]}
When you are using a functional language like erlang or lisp, traversing data is unavoidable in most of cases while sql doesn't need it. So it is better to do it with sql if you are storing data in a database like postgres that supports the sql but if you don't need to save data, you are in the correct way.