Is that possible to write the same Parquet folder from different processes in Python?
I use fastparquet.
It seems to work but I m wondering how it is possible for the _metadata file to not have conflicts in case two processes write to it at the same it.
Also to make it works I had to use ignore_divisions=True which is not ideal to get fast performance later when you read the Parquet file right?
Dask consolidates the metadata from the separate processes, so that it only writes the _metadata file once the rest is complete, and this happens in a single thread.
If you were writing separate parquet files to a single folder using your own multiprocessing setup, each would typically write the single data file and no _metadata at all. You could either gather the pieces like Dask does, or consolidate the metadata from the data files after they were ready.
Related
I have a table that contains more than a million records. I need to write this data on an excel file.
The problem is that the process is taking too much of time and it never completes. May be the process is using too much of memory or the excel sheet limit has been reached.
The process works fine for lower data limits(Eg: 10000). I am using WriteXLSX gem for data writing.
Is there a way to write large volumes of records on an excel file?
If you are running this write to Excel file in the same process that server is running, then I strongly advise you do this in background job (sidekiq or delayed jobs, there are a lot of bg engines out there).
If it is a separate process, then check your background engine settings and bump memory limit, or time-out limit - it all depends on what error you get.
I had to do something similar, but in reverse. I needed to create 600,000 objects based on a csv file that I uploaded to s3.
The best way to do this is to send it to a background job using Sidekiq to avoid timeouts.
Is it possible to move a file in GCS after the dataflow pipeline has finished running? If so, how? Should be the last .apply? I can't imagine that being the case.
The case here is that we are importing a lot of .csv's from a client. We need to keep those CSV's indefinitely, so we either need to "mark the CSV as being already handled", or alternatively, move them out of the initial folder that TextIO uses to find the csv's. The only thing I can currently think of is storing the file name (I'm not sure how I'd get this even, I'm a DF newbie) in BigQuery perhaps, and then excluding files that have already been stored from the execution pipeline somehow? But there has to be a better approach.
Is this possible? What should I check out?
Thanks for any help!
You can try using BlockingDataflowPipelineRunner and run arbitrary logic in your main program after p.run() (it will wait for the pipeline to finish).
See Specifying Execution Parameters, specifically the section "Blocking execution".
However, in general, it seems that you really want a continuously running pipeline that watches the directory with CSV files and imports new files as they appear, never importing the same file twice. This would be a great case for a streaming pipeline: you could write a custom UnboundedSource (see also Custom Sources and Sinks) that would watch a directory and return filenames in it (i.e. the T would probably be String or GcsPath):
p.apply(Read.from(new DirectoryWatcherSource(directory)))
.apply(ParDo.of(new ReadCSVFileByName()))
.apply(the rest of your pipeline)
where DirectoryWatcherSource is your UnboundedSource, and ReadCSVFileByName is also a transform you'll need to write that takes a file path and reads it as a CSV file, returning the records in it (unfortunately right now you cannot use transforms like TextIO.Read in the middle of a pipeline, only at the beginning - we're working on fixing this).
It may be somewhat tricky, and as I said we have some features in the works to make it a lot simpler and we're considering creating a built-in source like that, but it's possible that for now this would still be easier than "pinball jobs". Please give it a try and let us know at dataflow-feedback#google.com if anything is unclear!
Meanwhile, you can also store information about which files you have or haven't processed in Cloud Bigtable - it'd be a better fit for that than BigQuery, because it's more suited for random writes and lookups, while BigQuery is more suited for large bulk writes and queries over the full dataset.
I planning a hdfs system that will host image files (few Mb to 200mb) for a digital repository (Fedora Commons). I found from another stackoverflow post that CombineFileInputFormat can be used to create input splits consisting of multiple input files. Can this approach be used for images or pdf? Inside the map task, I want process individual files in their entirety i.e. process each image in the input split separately.
I'm aware of the small files problem, and it will not be an issue for my case.
I want to use CombineFileInputFormat for the benefits of avoiding Mapper task setup/cleanup overhead, and data-locality preservation.
If you want to process images in Hadoop, I can only recommend using HIPI, which should allow you to do what you need.
Otherwise, when you say you want to process individual files in their entirety, I don't think you can do this with conventional input formats, because even with CombineFileInputFormat, you would have no guarantee that what's in your split is exactly 1 image.
An approach you could also consider is to have in input a file containing URLs/locations of your images (for example you could put them in Amazon S3), and make sure you have as many mappers as images, and then each map task would be able to process an individual image. I've done something similar not so long ago and it worked ok.
If I have large number of files (n x 100K individual files) what would be most efficient way to store them in iOS file system (from speed of access to the file by path point of view)? Should I dump them all in single folder or break them in multilevel folder hierarchy.
Basically this breaks in three questions:
does file access time depend on number of "sibling" files (I think
answer is yes. If I am correct file names are organized into b-tree
so it should be O(log n))?
how expensive is traversing from one folder to another along the
path (is it something like m * O( log nm ) - where m is number of
components in the path and nm is number of "siblings" at each path
component )?
What gets cached at file system level to make above assumptions incorrect?
It would be great if some one had direct experience with this kind of problem and can share some real life results.
You comments will be highly appreciated
This seems like it might provide relevant, hard data:
File System vs Core Data: the image cache test
http://biasedbit.com/blog/filesystem-vs-coredata-image-cache
Conclusion:
File system cache is, as expected, faster. Core Data falls shortly behind when storing (marginally slower) but load times are way higher when performing single random accesses.
For such a simple case Core Data functionality really doesn't pay up, so stick to the file system version.
I think you should store everything is a one folder and create a hash table which include key (file name) and value (source path) pare.By creating hash table complexity with be constant log(1) and this will speed up your process as well.
The file system is not an optimal database. With that many thousands of files, you should consider using Core Data, or other database instead to store the name and contents of each file.
My use case is that there are updates coming from multiple sources and i've to store sum of all updates. One way is that i create separate rrd files for each source and run a cron that stores sum to the aggregate rrd file.
I was wondering if there is a way (using rrdcached perhaps?) that all sources update to this single rrd file and all updates inside same step gets summed together and stored in rrd.
Please let me know if this is possible.
--
Thanks.
well you could log to a single rrd file using the ABSOLUTE data source type ...