how can I efficiently add new avro record to an existing avro file. My avro file will keep on increasing in size, and I dont want to open file into memory.Could you please tell us how efficiently we can achieve this.
You can use DataFileWriter.appendTo. This won't load the existing file's contents in memory. (Under the hood, it will read the beginning of the file to find the schema and other metadata, then append to the end without loading what's in between.)
If you'd like to do this on HDFS, this gist might be a good place to start as well.
your question and related technology is ambiguous; but I will make an attempt to answer as best as I understand.
I will assume that you are doing this in HDFS.
data in dir vs files:
in HDFS, you can think in terms of directories rather than files.
The tools in Hadoop ecosystem, for e.g. hive or spark allow you to read "data" from directories without regard to the number of files stored in the dir.
This way, you add files to the directory and your "queries" will progressively display or fetch increasing amount of data.
AVRO , record based:
I'd think of an AVRO file as an avro record. LEt's say that you have an avro schema and you generate an object in memory/in your program/code and you convert that to AVRO format. This object will translate to one avro record. If you write that data to a file, that will be one avro record. over the course of 10 days, if you write 10 files in the same dir, you will now have 10 records, when you read the "directory".
immutability:
generally, I'd think of HDFS data as immutable. any file written is mostly to be read rather than modified. Same would apply to AVRO record as well, which is nothing but a file with schema and data. i.e. you will generally, never read the same file and modify it. I am assuming that you will be adding new data rather than modifying it. Consequently, you will be just creating new records.
serialize multiple objects to one file:
Now let's consider you truly want to write "multiple" objects to one file.
I will assume that you actually have these multiple objects in your hand/code at a given point in time and want to persist to a single file.
if you use jackson-dataformat-avro, it provides a SequenceWriter to do just that.
SequenceWriter w = mapper.writer(schema).writeValues(mySingleAvroFile);
w.write(firstObject);
w.write(secondObject);
...
w.close();
Related
I want to create a 3d dask array from data that I have that is already chunked. My data consists of 216 blocks containing 1024x1024x1024 uint8 voxels each, each stored as a compressed hdf5 file with one key called data. Compressed, my data is only a few megabytes per block, but decompressed, it takes 1GB per block. Furthermore, my data is currently stored in Google Cloud storage (gcs), although I could potentially mirror it locally inside a container.
I thought the easiest way would be to use zarr, following these instructions (https://pangeo.io/data.html). Would xarray have to decompress my data before saving to zarr format? Would it have to shuffle data and try to communicate across blocks? Is there a lower level way of assembling a zarr from hdf5 blocks?
There are a few questions there, so I will try to be brief and hope that some edits can flesh out details I may have omitted.
You do not need to do anything in order to view your data as a single dask array, since you can reference the individual chunks as arrays (see here) and then use the stack/concatenate functions to build up into a single array. That does mean opening every file in the client, in order to read the meatadata, though.
Similarly, xarray has some functions for reading sets of files, where you should be able to assume consistency of dtype and dimensionality - please see their docs.
As far as zarr is concerned, you could use dask to create the set of files for you on GCS or not, and choose to use the same chunking scheme as the input - then there will be no shuffling. Since zarr is very simple to set up and understand, you could even create the zarr dataset yourself and write the chunks one-by-one without having to create the dask array up front from the zarr files. That would normally be via the zarr API, and writing a chunk of data does not require any change to the metadata file, so can be done in parallel. In theory, you could simply copy a block in, if you understood the low-level data representation (e.g., int64 in C-array layout); however, I don't know how likely it is that the exact same compression mechanism will be available in both the original hdf and zarr (see here).
I'm reaching you hoping to find answers about Pentaho data integrator limitation.
I'm currentlty working on a 1 to 1 data source integration and would like to make it n to 1-n. This requires dynamic jobs creation and would like to know if any of came across such issue. My 1 to 1 is working perfectly, it integration form differents data source types (CSV, databases "Mysql, Oracle ...) to same date destination and need to make it n to 1-n.
There is a Metadata Injection Step just for that.
A use case similar to yours is described by Diethard here.
Because it seams that you have a lot of different source format, it may be a good investment to read the use case of Jens, the author of the step, here, which (apart for the automation) is precisely your case.
AFAIK in Pentaho DI, it is not possible to create dynamic transformations for any random data sources. PDI looks for the input columns to be available in the input stream before it loads the data to the target database. For example, if you are using 1 data source (in MySQL) and loading the same to the csv output, the csv output step is expecting the presence of input columns in the data source step (Table input). If you are trying to load any n random data sources you need to define input columns/fields for each of them individually.
Alternatively there are few things which you can explore:
1. Fast Dump in Text File Output step:
There is an option to fast data dump the data set in Text file output step. Here you don't need to define any output column. The input fields will be automatically dumped without formatting as it is. You can use this to map all of the input sources to a csv format and then load it to their targets.
2. Extending Java and Kettle together to build a solution:
PDI allows you to create custom JAVA codes on top of kettle. You can check this blog for more. You can use this idea to create custom code to pass n data sources fields to the kettle as a parameter and execute them. {note: i haven't tried this step, just thinking out loud here}
Hope this helps :)
I have a BigQuery table where each row represent a text file (gs://...) and a line number.
file, line, meta
file1.txt, 10, meta1
file2.txt, 12, meta2
file1.txt, 198, meta3
Each file is about 1.5Gb and there are about 1k files in the my bucket. My goal is extract lines specified in the BQ table.
I decided to implement the following plan:
Map table => KV<file,line>
Reduce KV<file,line> => KV<file, [lines]>
Map KV<file, [lines]> => [KV<file, rowData>]
where rowData means actual data from file on the some line from lines.
If I read docs and SO carefully, TextIO.Read isn't supposed to be used in such conditions. As a workaround I can use GcsIoChannelFactory to read files from GCS. Is it correct? Is it a preferable approach for the described task?
Yes, your approach is correct. There is currently no better approach to reading lines with line numbers from text files, except for doing it yourself using GcsIoChannelFactory (or writing a custom FileBasedSource, but this is more complex, and wouldn't work in your case because the filenames are not known in advance).
This and other similar scenarios will get much better with Splittable DoFn - work on that is in progress, but it is a large amount of work, so no timeline yet.
In our app we have a table called support_files which stores documents that have been uploaded , which are mostly PDFs.
I'd like to get a unique list of these files, often the same file is uploaded more than once. I thought that a way to do this would be to add a column to the database called "checksum", and then, for each file, calculate the checksum somehow and store it in the column. (This is obviously the slow part).
Once this is done then I can easily filter out duplicates from my table by examining the checksum column.
Can anyone recommend a method to generate this checksum/hash/whatever? Ideally I'd like to generate a hash/checksum that's large enough to guarantee uniqueness, but small enough to fit into a string field in my database.
My server's running on Ubuntu server, and the total number of files I need to checksum is currently around 12,000. For the sake of argument assume it won't grow over 100,000.
A bit of Googling reveals sha1sum, but this may be more suited to telling if a file has been accidentally changed rather than if two files are different?
Take a look at Digest::SHA256, it can interface directly with files and it works great.
From the referenced documentation:
p Digest::SHA256.file("X11R6.8.2-src.tar.bz2").hexdigest
# => "f02e3c85572dc9ad7cb77c2a638e3be24cc1b5bea9fdbb0b0299c9668475c534"
``
I'm trying to do the following in hadoop map/reduce( written in java, linux kernel OS)
Text files 'rules-1' and 'rules-2' (total 3GB in size) contains some rules, each rule are separated by endline character, so the files can be read using readLine() function.
These files 'rules-1' and 'rules-2' needs to be imported as a whole from hdfs in every map function in my cluster i.e. these file are not splittable across different map function.
Input to the mapper's map function is a text file called 'record' (each line is terminated by endline character), so from the 'record' file we get the (key, value) pair. The file is splittable and can be given as input to different map function used in the whole map/reduce process.
What needs to be done is compare each value(i.e. lines from record file) with the rules inside 'rules-1' and 'rules-2'
Problem is, if I pull out each line of rules-1 and rules-2 files to a static arraylist only once, so that each mapper can share the same arraylint and try to compare elements in the arraylist with the each input value from the record file, I get a memory overflow error, since 3GB cannot be stored at a time in the arraylist.
Alternatively, if I import only few lines from the rules-1 and rules-2 files at a time and compare them to each value, map/reduce is taking a lot time to finish its job.
Could you guys provide me any other alternative ideas how can this be done without the memory overflow error? Will it help if I put those file-1 and file-2 inside a hdfs supporting database or something? I'm going out of ideas actually.Would really appreciate if some of you guys could provide me your valuable suggestions.
Iif you input files are small - you can load them into static variables and use rules as an input.
If above is not a case I can suggest the following ways:
a) To give rule-1 and rule-2 high replication factor close to the number of nodes you have. Then you can read from HDFS rule=1 and rule-2 for each record in the input relatively efficient - because it will be sequential read from the local datanode.
b) If you can consider some hash function which, when applied to the rule and to the input string will predict without false negatives that they can match - then you can emit this hash for rules, input record and resolve all possible matches in the reducer. It will be very similar to the way how a join is done using MR
c) I would consider some other optimization techniques like building search trees, or sorting since otherwise the problem looks computationally expensive and will took forever...
On this page find Real-World Cluster Configurations
it will cover file size configuration
You could use the param "mapred.child.java.opts" in conf/mapred-site.xml to increase the memory for your mappers. You might not be able to run as many map slots per server but with more servers in your cluster you could still parallelize your job.
Read the content text file from the MapReduce function and read the keyword text file from the mapper function (for reading your HDFS) and split using StringTokenizer value.toString reading from MapReduce and in your mapper function write HDFS read text file code it will read line-by-line so use two while loops here you compare. Whenever you want data send it to reducer.
Split the 3gb text file into several text files and apply that all text files as usual MapReduce your previous program.
For splitting text file I written Java program and you decide how many lines you want write in each text file.