Mapping flat fields to sequential records - mapping

I have a source schema that defines a "ShippingCharge" and a "DiscountAmount". My destination schema is an EDI X12 850 message.
I need to create two "fake" iterations for the SAC loop. I need a way to define that for the first iteration, use the ShippingCharge and the second use the DiscountAmount. There are a few additional "default values" that I need to set to SAC01 that also depends on the iteration (1 or 2).
What functoid should I be using? Any suggestions?

Have you tried the Table Looping functoid? You can use the table looping functoid to define multiple rows using input links (ShippingCharge and DiscountAmount) and constants (the SAC01 values). The output would then loop through these rows and create the two SACLoop1 elements.
You will need to use the Table Extractor functiod as well to deal with each data value in the table.
Complete instructions on using Table Looping and Table Extractor can be found here: http://msdn.microsoft.com/en-us/library/aa559310%28v=bts.20%29.aspx

Related

Write to BQ one field of the rows of a PColl - Need the entire row for table selection

I have a problem:
My Pcoll is made of rows with this format
{'word':'string','table':'string'}
I want to write into BigQuery only the words, however I need the table field to be able to select the right table in BigQuery.
This is how my pipeline looks:
tobq = (input
| 'write names to BigQuery '>> beam.io.gcp.bigquery.WriteToBigQuery(
table=compute_table_name, schema=compute_schema,
insert_retry_strategy='RETRY_ON_TRANSIENT_ERROR',
create_disposition=beam.io.gcp.bigquery.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.gcp.bigquery.BigQueryDisposition.WRITE_APPEND)
)
The function compute_table_name accesses an element and returns the table field. Is there a way to write into BQ just the words while still having this table selection mechanism based on rows?
Many thanks!
Normally the best approach with a situation like this in BigQuery is to use the ignoreUnknownValues parameter in ExternalDataConfiguration. Unfortunately Apache Beam doesn't yet support enabling this parameter while writing to BigQuery, so we must find a workaround, as follows:
Pass a Mapping of IDs to Tables as a table_side_input
This solution only works if identical word values are guaranteed to map to the same table each time, or there is some kind of unique identifier for your elements. This method is a bit more involved than Solution 1, but it relies only on the Beam model instead of having to touch the BigQuery API.
The solution involves making use of table_side_input to dynamically pick which table to place an element even if the element is missing the table field. The basic idea is to create a dict of ID:table (where ID is either the unique ID, or just the word field). Creating this dict can be done with CombineGlobally by combining all elements into a single dict.
Meanwhile, you use a transform to drop the table field from your elements before the WriteToBigQuery transform. Then you pass the dict into the table_side_input parameter of WriteToBigQuery, and write a callable table parameter that checks with the dict to figure out which table to use, instead of the table field.

How to use kdb+ to track an arbitrary number of IOT scalar streams?

I am trying to use kdb+ to capture and do aggregations on a number of sensory streams collated from iot sensors.
Each sensor has a unique identifier a time component (.z.z) and a scalar value:
percepts:([]time:`datetime$(); id:`symbol$(); scalar:`float$())
However because the data is temporal in nature, it would seem logical to maintain separate perceptual/sensory streams in different columns, i.e.:
time id_1 id_2 ...
15 0.15 ...
16 ... 1.5
However appending to a table indicatively only supports row operations in the insert fashion i.e. percepts insert (.z.z; `id_1; 0.15)
Seen as though I would like to support an large and non-static number of sensors in this setup, it would seem like an anti-pattern to append rows of the aforementioned format, before doing a transformation thereafter to turn the rows into columns based on their id. Would it be possible/necessary to create a table with a dynamic (growing) number of columns based upon new feature streams?
How would one most effectively implement logic that allows the insertion of columnar time series data averting the need to do a transform on row based data?
You can add data to a specific column. For that make following changes:
Make time column as key either permanently or during an update operation.
Use upsert to add data and pass data in table format.
Update function that I have mentioned below is specific to your example but you can make it more generic. It takes sensor name and sensor data as input. It performs 3 steps:
It first checks if the table is empty, in that case, set table schema as input dataset schema(which according to your example should be time and sensor name columns) and also make time as a primary key.
If the table has data but the column is missing for new sensor then first add a column with null float values and then upsert the data.
If a column is already there then just upsert the data.
q)t:() / table to store all sensors data
q)upd:{[s;tbl] `t set $[0=count t;`time xkey 0#tbl;not s in cols t;![t;();0b;enlist[s]!enlist count[t]#0Nf];t] upsert tbl}
q)upd[`id1;([]time:1#.z.z;id1:1#14.4)]
q)upd[`id2;([]time:1#.z.z;id2:1#2.3)]
time id1 id2
--------------------------------
2019.08.26T13:35:43.203 14.4
2019.08.26T13:35:46.861 2.3
Some points regarding your design:
If all sensors are not sending data for each time entry then the table will have a lot of null values (similar to the sparse matrix) which would be waste of memory and will have some impact on queries as well.
In that case, you can consider other design depending on your use case. For example, instead of storing each time entry, store data in time buckets. Another option is to group related sensors in a different table instead of storing all in one.
Another point you need to consider is you will have a fat table if you keep on adding sensors to it and that has its own issues. Also, it will become a single bottleneck point which could be an issue in the future and scaling it would be hard.
For small sensor sets, the current design is good but if you are planning to add many sensors in future then look into other design options.

How to define large set of properties of a node without having to type them all?

I have imported a csv file into neo4j. I have been trying to define a large number of properties (all the columns) for each node. How can i do that without having to type in each name?
I have been trying this:
USING PERIODIC COMMIT
load csv WITH headers from "file:///frozen_catalog.csv" AS line
//Creating nodes for each product id with its properties
CREATE (product:product{id : line.`o_prd`,
Gross_Price_Average: TOINT(line.`Gross_Price_Average`),
O_PRD_SPG: TOINT(line.`O_PRD_SPG`)});
You can adding properties from maps. For example:
LOAD CSV WITH HEADERS FROM "http://data.neo4j.com/northwind/products.csv" AS row
MERGE (P:Product {productID: row.productID})
SET P += row
http://neo4j.com/docs/developer-manual/current/cypher/clauses/set/#set-adding-properties-from-maps
The LOAD CSV command cannot perform automatic type conversion to ints on certain fields, that must be done explicitly (though you can avoid having to explicitly mention all other fields by using the map projection feature to transform your line data before setting it via stdob--'s suggestion).
You may want to take a look at Neo4j's import tool, as this will allow you to specify field type in headers, which should perform type conversion for you.
That said, 77 columns is a lot of data to all store on individual nodes. You may want to take another look at your data and figure out if some of those properties would be better modeled as nodes with their own label with relationships to your product nodes. You mentioned some of these were categorical properties. Categories are well suited to be modeled separately as nodes instead of as properties, and maybe some of your other properties would work better as nodes as well.

bulk insert using OCI

I am currently inserting records one-by-one into a table from C++ code using OCI. The data is in a hashmap of structs, I iterate over the elements of the map, binding the attributes of the struct to the columns of a record in the table (e.g.
define insert query
use OCIBindByname( ) for all the columns of record
iterate over map
assign bind variables as attributes of the struct
OCIStmtExecute
end
This is pretty slow, so I'd like to speed up by doing a bulk insert. What is a good way to do this? Should I use an array of struct to insert all the records in one OCIStmtExecute? Do you have any example code which shows how to do this?
Here is some sample code showing how I implemented this in OCI*ML. In summary, the way to do this is (say for a table with one column of integers):
malloc() a block of memory of sizeof(int) × the number of rows and populate it. This could be an array.
Call OCIBindByPos() with that pointer for *valuep and the size for value_sz.
Call OCIStmtExecute() with iters set to the number of rows from step 1
In my experience, speedups of 100× are certainly possible.
What you probably want to do is 'bulk inserts'. Bulk inserts in array are done by using ArrayBinds where you bind the data of the first row with the first structure of the array and set jumps, which generally is size of structure. After this you can just do statement execute with number of arrays. Multiple binds will create overheads, hence bulk inserts are used.
bulk insert example.txt
by
{
delimeter=',' // or any delimiter specified in your text files
size=200kb //or your size of text file
}
Use DPL(Direct Path Loading).
Refer to docs.oracle.com for more info.

How to do lookups (or joins) with map-reduce?

How can I use take the input set
{worker-id:1 name:john supervisor-id:3}
{worker-id:2 name:jane supervisor-id:3}
{worker-id:3 name:bob}
and produce the output set
{worker-id:1 name:john supervisor-name:bob}
{worker-id:2 name:jane supervisor-name:bob}
using a "pure" map-reduce framework, i.e. one with only a map phase and a reduce phase but without any extra feature such as CouchDB's lookup?
Exact details will depend on your map-reduce framework. But the idea is this. In your map phase, you emit two types of key/value pairs. (1, {name:john type:boss}) and (3, {worker-id:1 name:john type:worker}). In your reduce phase you get all of the values for the key grouped together. If there is a record of type boss in there, then you remove that record and populate the supervisor-name of the other records. If there isn't, then you drop those records on the floor.
Basically you use the fact that data gets grouped by key then processed together in the reduce to do the join.
(In some map-reduce implementations you incrementally get key/value pairs put together in the reduce. In those implementations you can't throw away records that don't have a boss already, so you wind up needing to map-reduce-reduce for that final filtering step.)
There is Only one input file or more??
I mean, is it possible a case which we have a file that one of its worker-id have a supervisor-id which its descriptions(name of that supervisor-id) be in another file??

Resources