In InfluxDB, how to delete entire measurements using regex? - influxdb

Drop from <measurement_name> is used in InfluxDB to delete a all points from a specific measurement. You can also delete all measurements at once using regex (see In Influxdb, How to delete all measurements?).
But, how do I delete all points from all measurements starting with a specific string?

Here's how I managed to solve my specific problem:
Delete from /<starting_string>*/
of course it can be generalized to any applicable regex - just be sure to start and end your expression with /.
Hope that helps someone out there!

Related

How can I one-hot encode the data which has multiple same values for different properties?

I have data containing candidates who look for a job. The original data I got was a complete mess but I managed to enhance it. Now, I am facing an issue which I am not able to resolve.
One candidate record looks like
https://i.imgur.com/LAPAIbX.png
Since ML algorithms cannot work with categorical data, I want to encode this. My goal is to have a candidate record looking like this:
https://i.imgur.com/zzsiDzy.png
What I need to change is to add a new column for each possible value that exists in Knowledge1, Knowledge2, Knowledge3, Knowledge4, Tag1, and Tag2 of original data, but without repetition. I managed to encode it to get way more attributes than I need, which results in an inaccurate model. The way I tried gives me newly created attributes Jscript_Knowledge1, Jscript_Knowledge2, Jscript_Knowledge3 and so on, for each possible option.
If the explanation is not clear enough please let me know so that I could explain it further.
Thanks and any help is highly appreciated.
Cheers!
I have some understanding of your problem based on your explanation. I will try and elaborate how I would approach this problem. If that is not solving your problem, I may need more explanation to understand your problem. Lets get started.
For all the candidate data that you would have, collect a master
skill/knowledge list
This list becomes your columns
For each candidate, if he has this skill, the column becomes 1 for his record else it stays 0
This is the essence of one hot encoding, however, since same skill is scattered across multiple columns you are struggling with autoencoding it.
An alternative approach could be:
For each candidate collect all the knowledge skills as list and assign it into 1 column for knowledge and tags as another list and assign it to another column instead of current 4(Knowledge) + 2 (tags).
Sort the knowledge(and tag) list alphabetically within this column.
Auto One hot encoding after this may yield smaller columns than earlier
Hope this helps!

Cypher query on variable length path with specified end point

My graph model holds information on data lineage and how data moves from one column to another through column mappings in our ETL tool. A basic one hop pattern would look like this...
(source:Column)-[:SOURCE_OF_MAPPING]->(map:ColumnMapping)-[:TARGET_OF_MAPPING]->(target:Column)
so
source might be a column called "STAGING_TABLE_1.FULL_NAME",
target might be a column called "STAGING_TABLE_2.FULL_NAME" and
map would be whatever was specified in the select query within the ETL tool's dataflow. Perhaps something like "UPPER(STAGING_TABLE_1.FULL_NAME || STAGING_TABLE_1.TITLE)"
What I need to be able to do is say if I look at a specific target column, lets say "DATA_MART_FACT_1.FULL_NAME", from which column does this data originate from?
The following is the cypher query I am trying to use but this only pulls back a single hop, i.e. the source and column mapping where the target is "DATA_MART_FACT_1.FULL_NAME".
MATCH (source:Column)-[:SOURCE_OF_MAPPING*]->(c:ColumnMapping)-[:TARGET_OF_MAPPING*]->(target:Column)
WHERE target.name = 'DATA_MART_FACT_1.FULL_NAME'
RETURN source, target, c
I have tried removing the relationship names, and just having an asterisk in the square brackets, but this just kills my neo4j installation (Currently sitting at 5GB memory and 50% CPU usage and hanging for around 10 minutes). There are constraints on all of the unique properties.
I know the data contains what I need because in the neo4j browser I can expand the nodes and follow the path through as I would expect to be able. Can anyone provide me with a cypher query that will allow me to do this? Perhaps my graph model needs a slight refactor in terms of relationship names and directions to allow this to work, which I'm perfectly happy to explore.
Here is some cypher to generate a basic example.
CREATE
(_0:`Column` {`name`:"STAGING_TABLE_1.FULL_NAME"}),
(_1:`Column` {`name`:"STAGING_TABLE_2.FULL_NAME"}),
(_2:`Column` {`name`:"DATA_MART_FACT_1.FULL_NAME"}),
(_3:`ColumnMapping` {`mappingText`:"UPPER(STAGING_TABLE_1.FULL_NAME)"}),
(_4:`ColumnMapping` {`mappingText`:"LOWER(STAGING_TABLE_2.FULL_NAME)"}),
(_0)-[:`SOURCE_OF_MAPPING`]->(_3),
(_3)-[:`MAPS_TO`]->(_1),
(_1)-[:`SOURCE_OF_MAPPING`]->(_4),
(_4)-[:`MAPS_TO`]->(_2)
Then the query I used only return a single hop
MATCH (source:Column)-[:SOURCE_OF_MAPPING*..10]->(c:ColumnMapping)-[:MAPS_TO*..10]->(target:Column) WHERE target.name = 'DATA_MART_FACT_1.FULL_NAME' RETURN source, target, c
Then next query kind of returns what I'm after but is missing the relationship between the first 2 nodes.
MATCH (source:Column)-[:SOURCE_OF_MAPPING|MAPS_TO*..10]->(n)-[:MAPS_TO]->(target:Column)
WHERE target.name = 'DATA_MART_FACT_1.FULL_NAME'
AND (n:Column or n:ColumnMapping)
RETURN *;
The end result that I would like from this is as follows (note, aliases are just included here to illustrate the dataflow, and for my requirements the actual results don't need to be aliased)...
(c1:Column)-[:SOURCE_OF_MAPPING]->(cm1:ColumnMapping)-[:MAPS_TO]->(c2:Column)-[:SOURCE_OF_MAPPING]->(cm2:ColumnMapping)-[:MAPS_TO]-(target:Column)
and in tabular format
source | mapping | target
STAGING_TABLE_1.FULL_NAME | UPPER(STAGING_TABLE_1.FULL_NAME) | STAGING_TABLE_2.FULL_NAME
STAGING_TABLE_2.FULL_NAME | LOWER(STAGING_TABLE_2.FULL_NAME) | DATA_MART_FACT_1.FULL_NAME
Oddly, when I created an an interactive example (the site can be flaky and can sometimes take a few refreshes before it works) and although the table returns one row as per my local installation, the visual graph representation shows all of the expected nodes and relationships.
Any and all advice is appreciated. Thanks in advance.
EDIT: I've refactored my column mapping to target column relationship direction to make the flow core natural as if it were the data flowing from source, to column mapping, to target. There was no change in behaviour though.
You could try breaking your query using WITH, as:
MATCH (t:Column {name:'DATA_MART_FACT_1.FULL_NAME'})-[:TARGET_OF_MAPPING*]->(c)
WITH t, c
MATCH (c)-[:SOURCE_OF_MAPPING*]->(s)
RETURN s,t,c
That can cut down on the Cartesian product situations. I haven't tried in your case, but it's generally a good way to look at queries. Also, trim the fat on criteria--if :TARGET_OF_MAPPING only connects to :ColumnMapping, then you may not need to specify and test for that.
For the example, the following is how to get the start to the end
MATCH (n)-[:SOURCE_OF_MAPPING|MAPS_TO*..4]->(target:Column)
WHERE target.name = 'DATA_MART_FACT_1.FULL_NAME'
AND (n:Column or n:ColumnMapping)
RETURN *;
I suspect for a real life example, where there may be many consecutive column mappings the results may have to be aggregated in some way in order to avoid killing performance with open ended variable length paths.

How to delete field for a given measurement from influxdb?

I created multiple fields to test output in grafana, however I want to delete the unwanted fields from influxdb, is there a way?
Q: I want to delete the unwanted fields from influxdb, is there a way?
A: Short answer. No. Up until the latest release 1.4.0, there is no straightforward way to do this.
The reason why this is so was because Influxdb is explicitly designed to optimise point creation. Thus functionalities for the "UPDATE" and "DELETE" side of things are compromised for it.
To drop fields of a given measurement, the easiest way would be to;
Retrieve the data out first
Modify its content
Drop the measurement
Re-insert the modified data back
Reference:
https://docs.influxdata.com/influxdb/v1.4/concepts/insights_tradeoffs/

Neo4J Batch Inserter is slow with big ids

I'm working on an RDF files importer but I have a problem, my data files have duplicate nodes. For this reason, I use a big ids to insert the nodes using batch inserter but the proccess is slow. I have seen this post when Michael recommends to use a index but the process remains slow.
Another option would be to merge duplicate nodes but I think there is no automatic option to do so in Neo4J. Am I wrong?
Could anyone help me? :)
Thanks!
There is no duplicate handling in the CSV batch importer yet (it's planned for the next version), as it is non-trivial and memory expensive.
Best to de-duplicate on your side.
Don't use externally supplied id's as node-id's that can get large from the beginning that just doesn't work. Use an efficient map (like trove) to keep the mapping between your key and the node-id.
I usually use a two-pass and an array for it then sort the array, array index becomes node-id and after sorting you can do another pass that nulls-out duplicate entries
Perfect :) The data would have the following structure:
chembl_activity:CHEMBL_ACT_102540 bao:BAO_0000208 bao:BAO_0002146 .
chembl_document:CHEMBL1129248 cco:hasActivity
chembl_activity:CHEMBL_ACT_102551 .
chembl_activity:CHEMBL_ACT_102540 cco:hasDocument
chembl_document:CHEMBL1129248 .
Each line corresponds with a relationship between two nodes and we could see that the node chembl_activity:CHEMBL_ACT_102540 is duplicated.
I wanted to save as id the hashcode of the node name but that hashcode is a very large number that slows the process. So I could check for ids to only create the relationship and not the nodes.
Thanks for all! :)

Does label mechanism provide auto-indexing features when using neo4j-java API?

First of all, I bet that there is an answer on this question somewhere in docs, but since 'Manual: Labels and Indexes' link here gives me 404 error, I'm going to ask you anyway.
Is it possible to create an index on some label and specify it as an automatic one (just like legacy indexes I'm currently using, but for labels)?
If someone from neo4j team is reading this post, please let me know if I'm looking for the documentation in the right place, 'cause I can't find anything more or less informative on labels and indexes (except a couple of posts in Michael Hunger's blog and, maybe, some presentations, what is obviously not enough).
This is a more technical one: is it possible to find an item in the index by the regex? Suppose I have node with property 'n' -> '/a/b/c', and another node 'n' -> '/a/*/c. Can I somehow match them?
I don't work for Neo4j but I'll answer anyway.
All label indexing is automatic. Once you've created the index it maintains itself, possibly with minimal delay.
The manual for the last stable release can always be found here. The chapter on indexing for the embedded Java API is here.
You cannot use regexp with label indices yet. It's said to be on the agenda, along with index support for array lookups, i.e. what in Cypher would be
MATCH (a:MyLabel) WHERE a.value IN ['val1', 'val2']

Resources