I have data like this: schema1, a1, a2, ..., an, schema2, b1, b2, ...., bm. I know exactly how many data items I have for each schema. Can I write the two schema/data in one AVRO file, instead of two?
dataFileWrite API gives only create(). There is no append() for me to writer the second schema after the last data item of the first schema.
You should create a new union schema using Schema.createUnion(schema1,schema2) and use that one as the writer schema for your file. When reading the data you either use the union schema again if you have both types in your file or just the schema that you know is present.
Related
So, I have multiple tables in multiple databases.
The user sends me database details, table names, columns to be selected and which column to use as condition for join. The structure is similar to:
[{ database, table_name, join_column_name }]
My current implementation is like this,
Connect to the first DB
Fetch all data from the table
Store the data in a result variable (PG::Result instance)
Extract all unique values in current join_column_name from result to a filter_values array
Connect to next database
Fetch all data where value of current join_column_name is in filter_values
Store the data to a local_result variable (PG::Result instance)
Simulate inner join on result and local_result objects, and store in result
Repeat from step 4
Also, The join_column_name may or may not be an indexed column
In step 8, I have to create a new PG::Result object, and store mapped data from result and local_result objects. How do I attain this?
Is there a better way to do this, without all these custom logic?
I've looked into disable_joins: true, but don't know if it is applicable in this case.
I have imported a csv file into neo4j. I have been trying to define a large number of properties (all the columns) for each node. How can i do that without having to type in each name?
I have been trying this:
USING PERIODIC COMMIT
load csv WITH headers from "file:///frozen_catalog.csv" AS line
//Creating nodes for each product id with its properties
CREATE (product:product{id : line.`o_prd`,
Gross_Price_Average: TOINT(line.`Gross_Price_Average`),
O_PRD_SPG: TOINT(line.`O_PRD_SPG`)});
You can adding properties from maps. For example:
LOAD CSV WITH HEADERS FROM "http://data.neo4j.com/northwind/products.csv" AS row
MERGE (P:Product {productID: row.productID})
SET P += row
http://neo4j.com/docs/developer-manual/current/cypher/clauses/set/#set-adding-properties-from-maps
The LOAD CSV command cannot perform automatic type conversion to ints on certain fields, that must be done explicitly (though you can avoid having to explicitly mention all other fields by using the map projection feature to transform your line data before setting it via stdob--'s suggestion).
You may want to take a look at Neo4j's import tool, as this will allow you to specify field type in headers, which should perform type conversion for you.
That said, 77 columns is a lot of data to all store on individual nodes. You may want to take another look at your data and figure out if some of those properties would be better modeled as nodes with their own label with relationships to your product nodes. You mentioned some of these were categorical properties. Categories are well suited to be modeled separately as nodes instead of as properties, and maybe some of your other properties would work better as nodes as well.
could you please advise how to build "if statement" in SPSS Modeler if we have two data sources?
One data source (1) is a table (an output node generated by SPSS Modeler) where all the IDs are listed with which we need to work further.
Another data source (2) is an Excel file where all the IDs are listed whereas this list includes some IDs from (1) but also some additional ones - to all these IDs are assigned values that are needed to be added to the data source (1) not necessarily to the table.
So if the ID from (1) is in (2) we would like to assign a value from (2) to the ID in (1) and have it stored in some table or even better in a file.
Thank you very much for your help / advice.
Patricia
Based on your problem it sounds like you want to merge these datasets. This can be easily done in Modeler via the Merge Node, just make sure the variables have the same name or Modeler won't recognize it as a key. You can see an example here
You can also create a flag variable using the Derive node, see example here
You will have to use the Merge Node to combine the 2 datasets but you don't have to give the same name for the keys IDs. You can use the option condition in the Merge Node without the necessity of having the same name and even the same type of variable.
Syntax example for the merge Node - option condition: 'ID' = 'id'
I have a few TB logs data in JSON format, I want to convert them into Parquet format to gain better performance in analytics stage.
I've managed to do this by writing a mapreduce java job which uses parquet-mr and parquet-avro.
The only thing I'm not satisfied with is that, my JSON logs doesn't have a fixed schema, I don't know all the fields' names and types. Besides, even I know all the fields' names and types, my schema evolves as time goes on, for example, there will be new fields added in future.
For now I have to provide a Avro schema for AvroWriteSupport, and avro only allows fixed number of fields.
Is there a better way to store arbitrary fields in Parquet, just like JSON?
One thing for sure is that Parquet needs a Avro schema in advance. We'll focus on how to get the schema.
Use SparkSQL to convert JSON files to Parquet files.
SparkSQL can infer a schema automatically from data, thus we don't need to provide a schema by ourselves. Every time the data changes, SparkSQL will infer out a different schema.
Maintain an Avro schema manually.
If you don't use Spark but only Hadoop, you need to infer the schema manually. First write a mapreduce job to scan all JSON files and get all fields, after you know all fields you can write an Avro schema. Use this schema to convert JSON files to Parquet files.
There will be new unknown fields in future, every time there are new fields, add them to the Avro schema. So basically we're doing SparkSQL's job manually.
Use Apache Drill!
From https://drill.apache.org/docs/parquet-format/, in 1 line of SQL.
After setup Apache Drill (with or without HDFS), execute sqline.sh to run SQL queries:
// Set default format ALTER SESSION SET `store.format` = 'parquet';
ALTER SYSTEM SET `store.format` = 'parquet';
// Migrate data
CREATE TABLE dfs.tmp.sampleparquet AS (SELECT trans_id, cast(`date` AS date) transdate, cast(`time` AS time) transtime, cast(amount AS double) amountm, user_info, marketing_info, trans_info FROM dfs.`/Users/drilluser/sample.json`);
Should take a few time, maybe hours, but at the end, you have light and cool parquet files ;-)
In my test, query a parquet file is x4 faster than JSON and ask less ressources.
I have a table in a Rails application where each column is of type String. This column, which will shall call mycolumn, is formatted such that each entry fits the format NN:NN where N is some digit from 0 - 9.
Now where I'm having trouble is I need to find all the elements such that mycolumn[0..1] is with in a certain range (lets just say 35).
I'm thinking the statement would look something like
Mytable.find(:all, :conditions => ['? <= 35', :mycolumn[0..1].to_i])
Would this work? Is there another way to do this?
No, that won't work that way --- you are going to have to either:
format a SQL statement that works with the database you've chosen
Use a datatype in your table that you can custom query on (like store both NN and NN in separate columns)
Retrieve all models and do the select conditions in Ruby
I recommend #2 --- store the core data, and then add a method to format it to NN:NN