pyspark error while using loadlabledpoints RDD - parsing

I am using pyspark
I read a libsvm file, transpose it, and then save it again.
I save every data row as MLUtils.labeledpoint object with sparse data
I tried using MLUtils.saveaslibsvm and than read the files using MLUtils.loadlibsvm, and I get the following error
ValueError: could not convert string to float: [
at
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at
org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336) at
org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334) at
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1055)
at
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
at
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
at
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
at
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at
org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at
org.apache.spark.scheduler.Task.run(Task.scala:108) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
I read in the MLUtils page that if you want to use loadlabeledpoints, you need to save the data using RDD.saveAsTextFile but when i do this, i get
17/08/10 16:55:51 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID
3, 192.168.1.205, executor 0): org.apache.spark.SparkException: Cannot
parse a double from: [ at
org.apache.spark.mllib.util.NumericParser$.parseDouble(NumericParser.scala:120)
at
org.apache.spark.mllib.util.NumericParser$.parseArray(NumericParser.scala:70)
at
org.apache.spark.mllib.util.NumericParser$.parseTuple(NumericParser.scala:91)
at
org.apache.spark.mllib.util.NumericParser$.parse(NumericParser.scala:41)
at
org.apache.spark.mllib.regression.LabeledPoint$.parse(LabeledPoint.scala:62)
at
org.apache.spark.mllib.util.MLUtils$$anonfun$loadLabeledPoints$1.apply(MLUtils.scala:195)
at
org.apache.spark.mllib.util.MLUtils$$anonfun$loadLabeledPoints$1.apply(MLUtils.scala:195)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:121)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112)
at scala.collection.Iterator$class.foreach(Iterator.scala:893) at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) Caused by:
java.lang.NumberFormatException: For input string: "[" at
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110) at
java.lang.Double.parseDouble(Double.java:538) at
org.apache.spark.mllib.util.NumericParser$.parseDouble(NumericParser.scala:117)
... 30 more
How can i save RDD of labeled points as libsvm format and than load it back from the disk using pyspark?
Thanks

The issue was that writing the LabledPoints to file did not use the libsvm format and then it was hard to re-read it.
I solved it by creating the labled point in memory, and then before writing it to file, i converted it to libsvm format string, and then wrote it as text, after, i was able to read it as libsvm format
def pointToLibsvmRow(point):
s = point.features.reshape(2,-1, order="C").transpose().astype("str")
pairs = [str(int(float(point.label)))] + ["%s:%s" % (str(int(float(a))), b) for a, b in s.tolist()]
st = " ".join(pairs)
return st

Related

Error when opening a .nc file with raster package

I'm new to raster package in R, I was trying to open a .nc file with the package raster and some error popped out. In case you want to try I was using this dataset from copernicus of monthly sea salinity for the years 2018 and 2019 (the grid was Quebec St.Laurence stuary and Gaspesie coast).
I opened similar data files before but never got this error, and a search online did not clarify too much
Here is my script
library(raster)
library(ncdf4)
#Load the .nc files describing SSS.
SSS = stack('SSS.nc')
and the output error
Warning message:
In .rasterObjectFromCDF(x, type = objecttype, band = band, ...) :
"level" set to 1 (there are 17 levels)
thnx
I expected to create a a rasterstack object to work with
There is no error, there is a warning. If you want another level you can select it.
Either way, the "raster" package is obsolete, and you should try this with terra.
You can probably do
library(terra)
x <- rast('SSS.nc')
"Probably" because you do not provide a file. You should at least include a hyperlink to the file you are using. It is hard to help you without your example being reproducible.

Is there any way to transform "blkparse output file" to "blktrace raw binary file"?

I'm trying to replay blocktrace record file through fio (using --read_iolog option).
However, I only have the output file of blkparse (not output of blktrace or binary dump), which is not accpeted by fio.
blkparse output file (which I have) example:
8,0 1 1 0.000000000 30628 A WS 67045376 + 2047 <- (8,2) 65994752
Is there any way to transform "blkparse output file" to "raw blktrace (merged) binary file"?
= In the below situation, I only have B file without A file, how can I get C file?
blkparse --dump-binary=C --output=B --input=A
Thanks

How can I generate a single .avro file for large flat file with 30MB+ data

currently two avro files are getting generated for 10 kb file, If I follow the same thing with my actual file (30MB+) I will n number of files.
so need a solution to generate only one or two .avro files even if the source file of large.
Also is there any way to avoid manual declaration of column names.
current approach...
spark-shell --packages com.databricks:spark-csv_2.10:1.5.0,com.databricks:spark-avro_2.10:2.0.1
import org.apache.spark.sql.types.{StructType, StructField, StringType}
// Manual schema declaration of the 'co' and 'id' column names and types
val customSchema = StructType(Array(
StructField("ind", StringType, true),
StructField("co", StringType, true)))
val df = sqlContext.read.format("com.databricks.spark.csv").option("comment", "\"").option("quote", "|").schema(customSchema).load("/tmp/file.txt")
df.write.format("com.databricks.spark.avro").save("/tmp/avroout")
// Note: /tmp/file.txt is input file/dir, and /tmp/avroout is the output dir
Try specifying number of partitions of your dataframe while writing the data as avro or any format. To fix this use repartition or coalesce df function.
df.coalesce(1).write.format("com.databricks.spark.avro").save("/tmp/avroout")
So that it writes only one file in "/tmp/avroout"
Hope this helps!

High Aerospike latency

In the aerospike set we have four bins userId, adId, timestamp, eventype and the primary key is userId:timestamp. Secondary Index is created on userId to get all the records for a particular user and the resulted records are passed to stream udf. On our client side till 500 qps the aerospike query latency is reasonable and mean latency is in microseconds but as soon as we increase the qps above 500 the aerospike query latency shoots up (around ~ 10 ms)
message that we see on the client side is attached below:
Name: Aerospike-13780
State: WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject#41aa29a4
Total blocked: 0 Total waited: 554,450
Stack trace:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403)
com.aerospike.client.lua.LuaInputStream.read(LuaInputStream.java:38)
com.aerospike.client.lua.LuaStreamLib$read.call(LuaStreamLib.java:60)
org.luaj.vm2.LuaClosure.execute(Unknown Source)
org.luaj.vm2.LuaClosure.onInvoke(Unknown Source)
org.luaj.vm2.LuaClosure.invoke(Unknown Source)
org.luaj.vm2.LuaClosure.execute(Unknown Source)
org.luaj.vm2.LuaClosure.onInvoke(Unknown Source)
org.luaj.vm2.LuaClosure.invoke(Unknown Source)
org.luaj.vm2.LuaValue.invoke(Unknown Source)
com.aerospike.client.lua.LuaInstance.call(LuaInstance.java:128)
com.aerospike.client.query.QueryAggregateExecutor.runThreads(QueryAggregateExecutor.java:104)
com.aerospike.client.query.QueryAggregateExecutor.run(QueryAggregateExecutor.java:77)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
below is the lua file:
function ad_count(stream)
local function map_function(record)
local result = map()
result["adId"] = record["adId"]
result["timestamp"] = record["timestamp"]
return result
end
local function add_fn(aggregate, record)
local ad_id = record["adId"]
local map_result = aggregate[ad_id]
local l = list()
if map_result == null then
map_result = l
end
list.append(map_result, record["timestamp"])
aggregate[ad_id] = map_result
return aggregate
end
local m = map()
return stream:map(map_function):aggregate(m, add_fn)
end
There are 2 nodes and the server is hosted in AWS with T2.large instance type.
transaction-queues=8;transaction-threads-per-queue=8;transaction-duplicate-threads=0;transaction-pending-limit=20;migrate-threads=1;migrate-xmit-priority=40;migrate-xmit-sleep=500;migrate-read-priority=10;migrate-read-sleep=500;migrate-xmit-hwm=10;migrate-xmit-lwm=5;migrate-max-num-incoming=256;migrate-rx-lifetime-ms=60000;proto-fd-max=15000;proto-fd-idle-ms=60000;proto-slow-netio-sleep-ms=1;transaction-retry-ms=1000;transaction-max-ms=1000;transaction-repeatable-read=false;dump-message-above-size=134217728;ticker-interval=10;microbenchmarks=false;storage-benchmarks=false;ldt-benchmarks=false;scan-max-active=100;scan-max-done=100;scan-max-udf-transactions=32;scan-threads=4;batch-index-threads=4;batch-threads=4;batch-max-requests=5000;batch-max-buffers-per-queue=255;batch-max-unused-buffers=256;batch-priority=200;nsup-delete-sleep=100;nsup-period=120;nsup-startup-evict=true;paxos-retransmit-period=5;paxos-single-replica-limit=1;paxos-max-cluster-size=32;paxos-protocol=v3;paxos-recovery-policy=manual;write-duplicate-resolution-disable=false;respond-client-on-master-completion=false;replication-fire-and-forget=false;info-threads=16;allow-inline-transactions=true;use-queue-per-device=false;snub-nodes=false;fb-health-msg-per-burst=0;fb-health-msg-timeout=200;fb-health-good-pct=50;fb-health-bad-pct=0;auto-dun=false;auto-undun=false;prole-extra-ttl=0;max-msgs-per-type=-1;service-threads=40;fabric-workers=16;pidfile=/var/run/aerospike/asd.pid;memory-accounting=false;udf-runtime-gmax-memory=18446744073709551615;udf-runtime-max-memory=18446744073709551615;sindex-builder-threads=4;sindex-data-max-memory=18446744073709551615;query-threads=6;query-worker-threads=15;query-priority=10;query-in-transaction-thread=0;query-req-in-query-thread=0;query-req-max-inflight=100;query-bufpool-size=256;query-batch-size=100;query-priority-sleep-us=1;query-short-q-max-size=500;query-long-q-max-size=500;query-rec-count-bound=18446744073709551615;query-threshold=10;query-untracked-time-ms=1000;pre-reserve-qnodes=false;service-address=0.0.0.0;service-port=3000;mesh-address=10.0.1.80;mesh-port=3002;reuse-address=true;fabric-port=3001;fabric-keepalive-enabled=true;fabric-keepalive-time=1;fabric-keepalive-intvl=1;fabric-keepalive-probes=10;network-info-port=3003;enable-fastpath=true;heartbeat-mode=mesh;heartbeat-protocol=v2;heartbeat-address=10.0.1.80;heartbeat-port=3002;heartbeat-interval=150;heartbeat-timeout=10;enable-security=false;privilege-refresh-period=300;report-authentication-sinks=0;report-data-op-sinks=0;report-sys-admin-sinks=0;report-user-admin-sinks=0;report-violation-sinks=0;syslog-local=-1;enable-xdr=false;xdr-namedpipe-path=NULL;forward-xdr-writes=false;xdr-delete-shipping-enabled=true;xdr-nsup-deletes-enabled=false;stop-writes-noxdr=false;reads-hist-track-back=1800;reads-hist-track-slice=10;reads-hist-track-thresholds=1,8,64;writes_master-hist-track-back=1800;writes_master-hist-track-slice=10;writes_master-hist-track-thresholds=1,8,64;proxy-hist-track-back=1800;proxy-hist-track-slice=10;proxy-hist-track-thresholds=1,8,64;udf-hist-track-back=1800;udf-hist-track-slice=10;udf-hist-track-thresholds=1,8,64;query-hist-track-back=1800;query-hist-track-slice=10;query-hist-track-thresholds=1,8,64;query_rec_count-hist-track-back=1800;query_rec_count-hist-track-slice=10;query_rec_count-hist-track-thresholds=1,8,64
We even changed the following config parameters but it further increased the latency:
query-batch-size=1000
query-short-q-max-size=100000
query-long-q-max-size=100000
query-threads=28
query-worker-threads=400
query-req-max-inflight=1000
Linking back to the same question on the Aerospike community forum: https://discuss.aerospike.com/t/high-query-latency/2769/3
You're mainly bumping against the limits of that specific instance. We do not recommend using the t2 family. Please review the recommendations section of Aerospike's Amazon deployment guide. You should consider instances in the m3, c3, c4, r3, or i2 instance families.
A quick note on your configuration:
transaction-queues should be tuned to the number of cores. You have it set to 8, and a t2.large has 2 cores.
transaction-threads-per-queue is set too high for this instance type.
query-threads and query-worker-threads are both set too high for this instance.
Further, your queries involves a stream UDF so they will be slower than a regular query. UDFs are useful but you need to consider the context for which they're a good fit.
My suggestion is that you can probably model this differently, and skip the secondary index and UDF.

Can h5py load a file from a byte array in memory?

My python code is receiving a byte array which represents the bytes of the hdf5 file.
I'd like to read this byte array to an in-memory h5py file object without first writing the byte array to disk. This page says that I can open a memory mapped file, but it would be a new, empty file. I want to go from byte array to in-memory hdf5 file, use it, discard it and not to write to disk at any point.
Is it possible to do this with h5py? (or with hdf5 using C if that is the only way)
You could try to use Binary I/O to create a File object and read it via h5py:
f = io.BytesIO(YOUR_H5PY_STREAM)
h = h5py.File(f,'r')
You can use io.BytesIO or tempfile to create h5 objects, which showed in official docs http://docs.h5py.org/en/stable/high/file.html#python-file-like-objects.
The first argument to File may be a Python file-like object, such as an io.BytesIO or tempfile.TemporaryFile instance. This is a convenient way to create temporary HDF5 files, e.g. for testing or to send over the network.
tempfile.TemporaryFile
>>> tf = tempfile.TemporaryFile()
>>> f = h5py.File(tf)
or io.BytesIO
"""Create an HDF5 file in memory and retrieve the raw bytes
This could be used, for instance, in a server producing small HDF5
files on demand.
"""
import io
import h5py
bio = io.BytesIO()
with h5py.File(bio) as f:
f['dataset'] = range(10)
data = bio.getvalue() # data is a regular Python bytes object.
print("Total size:", len(data))
print("First bytes:", data[:10])
The following example uses tables which can still read and manipulate the H5 format in lieu of H5PY.
import urllib.request
import tables
url = 'https://s3.amazonaws.com/<your bucket>/data.hdf5'
response = urllib.request.urlopen(url)
h5file = tables.open_file("data-sample.h5", driver="H5FD_CORE",
driver_core_image=response.read(),
driver_core_backing_store=0)

Resources