infer byte size of JSON file stored in cassandra column for each row

infer byte size of JSON file stored in cassandra column for each row - stored-procedures

I'm querying a vendor's cassandra database to fetch data from a table. The data returned is a JSON file stored as text. I want to determine the average size of the json file in the cassandra table.
Also other stats like max size and min size for each partition.
Can we achieve the above requirement using SELECT+Aggregate functions queries?
Please suggest to get the desired output

nodetool tablestats will tell you the min/max/avg for the partitions in a given table. nodetool tablehistograms will give you even finer grained information.

Related

Read parquet in chunks according to ordered column index with pyarrow

I have a dataset composed of multiple parquet files clip1.parquet, clip2.parquet,.... Each row corresponds to a point in some frame and there's an ordered column specifying the corresponding frame frame: 1,1,...1,2,2,2...2,3,3...3.... There are several thousand rows for each frame, but the exact number is not necessarily the same. Frame numbers do not reset in each clip.
What is the fastest way to iteratively read all rows belonging to one frame?
Loading the whole dataset to memory is not possible. I assume a standard row filter will check against all rows which is not optimal (I know they are ordered by frame). I was thinking it could be possible to match a row-group for each frame, but wasn't sure if it's a good practice or even possible with different sized groups.
Thanks!

It is reasonable in your case to consider the frame column as your index, and you can specify this when loading. If you scan the metadata of all the files (this is fast for local data, but not on by default), then dask will know the min and max frame values for each file. Therefore, selecting on the index will only read the files which have at least some corresponding values.
df = dd.read_parquet("clip*.parquet", index="frame", calculate_divisions=True)
df[df.index == 1] # so something with this
Alternatively, you can specify filters in readparquet, if you want even more control, and you would make a new dataframe object for each iteration.
Note, however, that a groupby might do what you want, without having to iterate over the frame numbers. Dask is pretty smart about loading only part of the data at a time and aggregating partial results from each partition. How well this works depends on how complicated an algorithm you want to do to each row set.
I should mention that both parquet backend support all these options, you don't specifically need pyarrow.

influxdb query group by value

I am new to influxdb and the TICK environment so maybe it is a basic question but I have not found how to do this. I have installed Influxdb 1.7.2, with Telegraph listening to a MQTT server that receives JSON data generated by different devices. I have Chronograph to visualize the data that is being recieved.
JSON data is a very simple message indicating the generating device as a string and some numeric values detected. I have created some graphs indicating the number of messages recieved in 5 minutes lapse by one of the devices.
SELECT count("devid") AS "Device" FROM "telegraf"."autogen"."mqtt_consumer" WHERE time > :dashboardTime: AND "devid"='D9BB' GROUP BY time(5m) FILL(null)
As you can see, in this query I am setting the device id by hand. I can set this query alone in a graph or combine multiple similar queries for different devices, but I am limited to previously identifying the devices to be controlled.
Is it posible to obtain the results grouped by the values contained in devid? In SQL this would mean including something like GROUP BY "devid", but I have not been able to make it work.
Any ideas?

You can use "GROUP BY devid" if devid is a tag in measurement scheme. In case of devid being the only tag the number of unique values of devid tag is the number of time series in "telegraf"."autogen"."mqtt_consumer" measurement. Typically it is not necessary to use some value both as tag and field. You can think of a set of tags in a measurement as a compound unique index(key) in conventional SQL database.

SNMP operations on MIB

Hello i am creating a MIB and i have a table with attributes of files. I have name, file type. etc... and a DateAndTime object to represent the time at which the file was created.
In order to delete elements of said table one column has to be of the RowStatus type.
Now my question is, if i wanted to get all files that were created in the last 12 hours what command sequence would the snmp agent use to select that?
To my knowledge it is not possible to select data within a timeframe attribute inside a table.

I found there is no way to select data with timestamps in SNMP as you would do in a sql query.
In a table you have to read all the data and, if needed just select he lines that start within the timeframe you are looking for.

How do you INSERT into influxDB using the SQL-like interface?

Is it possible to INSERT data into series / measurements using the SQL-like interface on InfluxDB?

Yes, you can simply INSERT a Line Protocol string.
An example from Getting Started:
INSERT cpu,host=serverA,region=us_west value=0.64
A point with the measurement name of cpu and tags host and region has now been written to the database, with the measured value of 0.64.

Neo4j console returning only 1000 rows

I would like to compare mysql and neo4j and for that I have dumped a lot of data in neo4j as well mysql.
now the problem is In neo4j I cannot execute a query which returns more than 1000 rows therefore I cannot see the time of execution of that query.
In mysql I can eaily see the execution time in the console.
Also I would like to see a complete graphical view of all my nodes in Neo4j. Is it possible?

The limitation to 1000 result is a safety net withing Neo4j browser. Otherwise very long results might mess up your web browser in terms of memory and/or CPU for rendering.
To get the full plain results for comparison send your query as REST request using e.g. cURL. See http://docs.neo4j.org/chunked/stable/rest-api-transactional.html#rest-api-begin-and-commit-a-transaction-in-one-request for an example of the request, make sure you're using Accept and Content-Type header set to application/json. Additionally you might stream the result as documented at http://docs.neo4j.org/chunked/stable/rest-api-streaming.html.

There is a setting that limits the number of rows displayed, by default it's 1000. To change this setting Open Neo4j Browser -> Go to Settings and in the Graph Visualization section you can find Max Rows. You can change this to fit your needs. Also in this section another property can be found Initial Node Display - property that limit number of nodes displayed on the first load of the graph visualization.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

infer byte size of JSON file stored in cassandra column for each row - stored-procedures

nodetool tablestats will tell you the min/max/avg for the partitions in a given table. nodetool tablehistograms will give you even finer grained information.

Related

Read parquet in chunks according to ordered column index with pyarrow

influxdb query group by value

SNMP operations on MIB

How do you INSERT into influxDB using the SQL-like interface?

Neo4j console returning only 1000 rows

Categories

Resources