How can I generate a single .avro file for large flat file with 30MB+ data - avro

currently two avro files are getting generated for 10 kb file, If I follow the same thing with my actual file (30MB+) I will n number of files.
so need a solution to generate only one or two .avro files even if the source file of large.
Also is there any way to avoid manual declaration of column names.
current approach...
spark-shell --packages com.databricks:spark-csv_2.10:1.5.0,com.databricks:spark-avro_2.10:2.0.1
import org.apache.spark.sql.types.{StructType, StructField, StringType}
// Manual schema declaration of the 'co' and 'id' column names and types
val customSchema = StructType(Array(
StructField("ind", StringType, true),
StructField("co", StringType, true)))
val df = sqlContext.read.format("com.databricks.spark.csv").option("comment", "\"").option("quote", "|").schema(customSchema).load("/tmp/file.txt")
df.write.format("com.databricks.spark.avro").save("/tmp/avroout")
// Note: /tmp/file.txt is input file/dir, and /tmp/avroout is the output dir

Try specifying number of partitions of your dataframe while writing the data as avro or any format. To fix this use repartition or coalesce df function.
df.coalesce(1).write.format("com.databricks.spark.avro").save("/tmp/avroout")
So that it writes only one file in "/tmp/avroout"
Hope this helps!

Related

pyarrow - identify the fragments written or filters used when writing a parquet dataset?

My use case is that I want to pass the file paths or filters to a task in Airflow as an xcom so that my next task can read the data which was just processed.
Task A writes a table to a partitioned dataset and a number of Parquet file fragments are generated --> Task B reads those fragments later as a dataset. I need to only read relevant data though, not the entire dataset which could have many millions of rows.
I have tested two approaches:
List modified files right after I finish writing to the dataset. This will provide me with a list of paths which I can call ds.dataset(paths) on during my next task. I can use partitioning.parse() on these paths or check the fragments to get a list of filters used (frag.partition_expression)
A flaw with this is that I can have files being written in parallel to the same dataset.
I can generate the filters used when writing the dataset by turning the table into a pandas dataframe, doing a groupby, and then constructing filters. I am not sure if there is a simpler approach to this. I can then use pq._filters_to_expression() on the results to create a usable filter.
This is not ideal since I need to fix certain data types which do not get saved properly as an Airflow xcom (no pickling so everything has to be in json format). Also, if I want to partition on a dictionary column, I might need to tweak this function.
def create_filter_list(df, partition_columns):
"""Creates a list of pyarrow filters to be sent through an xcom and evaluated as an expression. Xcom disables pickling, so we need to save timestamp and date values as strings and convert downstream"""
filter_list = []
value_list = []
partition_keys = [df[col] for col in partition_columns]
for keys, _ in df[partition_columns].groupby(partition_keys):
if len(partition_columns) == 1:
if is_jsonable(keys):
value_list.append(keys)
elif keys is not None:
value_list.append(str(keys))
else:
if not isinstance(keys, tuple):
keys = (keys,)
read_filter = []
for name, val in zip(partition_columns, keys):
if type(val) == np.int_:
read_filter.append((name, "==", int(val)))
elif val is not None:
read_filter.append((name, "==", str(val)))
filter_list.append(read_filter)
if len(partition_columns) == 1:
if len(value_list) > 0:
filter_list = [(name, "in", value_list) for name in partition_columns]
return filter_list
Any suggestions on which approach I should take, or if there is a better way to achieve my goal?
You can watch this issue (https://issues.apache.org/jira/browse/ARROW-10440) which does what you want I believe. In the meantime, you could use basename_template as a workaround.
import glob
import os
import pyarrow as pa
import pyarrow.dataset as pads
class TrackingWriter:
def __init__(self):
self.counter = 0
part_schema = pa.schema({'part': pa.int64()})
self.partitioning = pads.HivePartitioning(part_schema)
def next_counter(self):
result = self.counter
self.counter += 1
return result
def write_dataset(self, table, base_dir):
counter = self.next_counter()
pads.write_dataset(table, base_dir, format='parquet', partitioning=self.partitioning, basename_template=f'batch-{counter}-part-{{i}}')
files_written = glob.glob(os.path.join(base_dir, '**', f'batch-{counter}-*'))
return files_written
table_one = pa.table({'part': [0, 0, 1, 1], 'val': [1, 2, 3, 4]})
table_two = pa.table({'part': [0, 0, 1, 1], 'val': [5, 6, 7, 8]})
writer = TrackingWriter()
print(writer.write_dataset(table_one, '/tmp/mydataset'))
print(writer.write_dataset(table_two, '/tmp/mydataset'))
This is just a rough sketch. You'd probably also want code to run at startup to see what the next free value of counter is. Or you could use a uuid instead of a counter.
A suggestion (not sure if this is optimal for your use case or not):
The key problem is the need to correctly select subset of the data, this can be 'fixed' upstream. The function/script that updates the big dataframe can contain a condition to save a temporary copy of data that is modified and satisfies some requirements in a separate (temporary) path. Then this file would be passed to the downstream tasks, which can delete the temporary file once it's processed.

Modify values programmatically SPSS

I have a file with more than 250 variables and more than 100 cases. Some of these variables have an error in decimal dot (20445.12 should be 2.044512).
I want to modify programatically these data, I found a possible way in a Visual Basic editor provided by SPSS (I show you a screen shot below), but I have an absolute lack of knowledge.
How can I select a range of cells in this language?
How can I store the cell once modified its data?
--- EDITED NEW DATA ----
Thank you for your fast reply.
The problem now its the number of digits that number has. For example, error data could have the following format:
Case A) 43998 (five digits) ---> 4.3998 as correct value.
Case B) 4399 (four digits) ---> 4.3990 as correct value, but parsed as 0.4399 because 0 has been removed when file was created.
Is there any way, like:
IF (NUM < 10000) THEN NUM = NUM / 1000 ELSE NUM = NUM / 10000
Or something like IF (Number_of_digits(NUM)) THEN ...
Thank you.
there's no need for VB script, go this way:
open a syntax window, paste the following code:
do repeat vr=var1 var2 var3 var4.
compute vr=vr/10000.
end repeat.
save outfile="filepath\My corrected data.sav".
exe.
Replace var1 var2 var3 var4 with the names of the actual variables you need to change. For variables that are contiguous in the file you may use var1 to var4.
Replace vr=vr/10000 with whatever mathematical calculation you would like to use to correct the data.
Replace "filepath\My corrected data.sav" with your path and file name.
WARNING: this syntax will change the data in your file. You should make sure to create a backup of your original in addition to saving the corrected data to a new file.

how to form fields section in xfd file

I'm having problem to form the field section's structure into xfd files after analyse by issuing commnad "vutil32.exe -i -kx pogl.dad". I hope somebody could help me out how to form out field structure as highlighted in below. I've uploaded sample of my file known as "pglc.dad" i hope soneone could guide me how to form .xfd file from his expert skills and guide me.Thanks
Result from vutil32.exe
file size: 250880
record size (min/max): 121/1024 compressed(80%)
# of keys: 4
key size: 16:02 31:03 56:03 15
key offset: 0 0 0 1
duplicates okay: N N N N
block size: 512
blocks per granule: 1
tree height: 4/2/2.7
# of nodes: 200
# of deleted nodes: 1
total node space: 101800
node space used: 67463 (66%)
user count: 0
Key Dups Seg-1 Seg-2 Seg-3 Seg-4 Seg-5 Seg-6
(sz/of) (sz/of) (sz/of) (sz/of) (sz/of) (sz/of)
0 N 1/0 15/1
1 N 1/0 15/66 15/1
2 N 1/0 40/81 15/1
3 N 15/1
Here is my further construction of .xfd file.
XFD,02,PGLC,PGLC
00300,00041,004
1,0,013,00000
01
PGSTAT
3,0,004,00004,020,00021,004,00000
3
PGSTAT
PGDESC
PGLINE
3,0,004,00004,008,00013,004,00000
03
PGSTAT
PGDESC
PGLINE
1,0,012,00021
01
PGSTAT
000
0150,00150,00003 =================>> How can i form this field section.
00000,00013,16,00016,+00,000,000,PGSTAT
00000,00001,16,00001,+00,000,000,PGDESC
00001,00015,16,00015,+00,000,000,PGLINE
here is the link for my pglc.dad : http://files.engineering.com/getfile.aspx?folder=080fdad6-b1d5-4a37-8dd0-b89f9a985c69&file=PGLC.DAD
Thanks appopriate to someone could helps.
I know the XFD format intimately as I have written a couple of parsers of this file format in both Perl and Cobol.
Having said that, I would strongly recommend that you do not try to hand craft an XFD file from scratch.
If you have an AcuCobol (MicroFocus) compiler, and the source of the file's SELECT and FD definitions, then you can create a very small Cobol program that has just the SELECT and FD definitions and then compile the program using:
ccbl32.exe -Fx <program>
That will create an XFD file for the indexed file definition. Note, you can specify a directory for the created XFD file using the -Fo <directory> option.
If you don't have the source of the file definitions, then you are just going to be guessing what and where the fields are. The indexed file by itself will not tell you that information. I can see from extracting the data in your file (using the vutil -e option) that the file contains binary data as well as text, so without knowing exactly what PICture those fields are (COMP-?) you will be struggling to figure out the structure of those fields.

Reading and parsing a large .dat file

I am trying to parse a huge .dat file (4gb). I have tried with R but it just takes too long. Is there a way to parse a .dat file by segments, for example every 30000 lines? Any other solutions would also be welcomed.
This is what it looks like:
These are the first two lines with header:
ST|ZIPCODE|GEO_ID|GEO_TTL|FOOTID_GEO|NAICS2012|NAICS2012_TTL|FOOTID_NAICS|YEAR|EMPSZES|EMPSZES_TTL|ESTAB|ESTAB_F <br/>
01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|001|All establishments|167| <br/>
01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|212|Establishments with 1 to 4 employees|91|
This is an option to read data faster in R by using the fread function in the data.table package.
EDIT
I removed all <br/> new-line tags. This is the edited dataset
ST|ZIPCODE|GEO_ID|GEO_TTL|FOOTID_GEO|NAICS2012|NAICS2012_TTL|FOOTID_NAICS|YEAR|EMPSZES|EMPSZES_TTL|ESTAB|ESTAB_F
01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|001|All establishments|167|
01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|212|Establishments with 1 to 4 employees|91|
Then I matched variables with classes. You should use nrows ~ 100.
colclasses = sapply(read.table(edited_data, nrows=1, sep="|", header=T),class)
Then I read the edited data.
your_data <- fread(edited_data, sep="|", sep2=NULL, nrows=-1L, header=T, na.strings="NA",
stringsAsFactors=FALSE, verbose=FALSE, autostart=30L, skip=-1L, select=NULL,
colClasses=colclasses)
Everything worked like a charm. In case you have problems removing the tags, use this simple Python script (it will take some time for sure):
original_file = file_path_to_original_file # e.g. "/Users/User/file.dat"
edited_file = file_path_to_new_file # e.g. "/Users/User/file_edited.dat"
with open(original_file) as inp:
with open(edited_file, "w") as op:
for line in inp:
op.write(line.replace("<br/>", "")
P.S.
You can use read.table with similar optimizations, but it won't give you nearly as much speed.

Can h5py load a file from a byte array in memory?

My python code is receiving a byte array which represents the bytes of the hdf5 file.
I'd like to read this byte array to an in-memory h5py file object without first writing the byte array to disk. This page says that I can open a memory mapped file, but it would be a new, empty file. I want to go from byte array to in-memory hdf5 file, use it, discard it and not to write to disk at any point.
Is it possible to do this with h5py? (or with hdf5 using C if that is the only way)
You could try to use Binary I/O to create a File object and read it via h5py:
f = io.BytesIO(YOUR_H5PY_STREAM)
h = h5py.File(f,'r')
You can use io.BytesIO or tempfile to create h5 objects, which showed in official docs http://docs.h5py.org/en/stable/high/file.html#python-file-like-objects.
The first argument to File may be a Python file-like object, such as an io.BytesIO or tempfile.TemporaryFile instance. This is a convenient way to create temporary HDF5 files, e.g. for testing or to send over the network.
tempfile.TemporaryFile
>>> tf = tempfile.TemporaryFile()
>>> f = h5py.File(tf)
or io.BytesIO
"""Create an HDF5 file in memory and retrieve the raw bytes
This could be used, for instance, in a server producing small HDF5
files on demand.
"""
import io
import h5py
bio = io.BytesIO()
with h5py.File(bio) as f:
f['dataset'] = range(10)
data = bio.getvalue() # data is a regular Python bytes object.
print("Total size:", len(data))
print("First bytes:", data[:10])
The following example uses tables which can still read and manipulate the H5 format in lieu of H5PY.
import urllib.request
import tables
url = 'https://s3.amazonaws.com/<your bucket>/data.hdf5'
response = urllib.request.urlopen(url)
h5file = tables.open_file("data-sample.h5", driver="H5FD_CORE",
driver_core_image=response.read(),
driver_core_backing_store=0)

Resources