How does Pig handle unstructured data while Hive can't?

How does Pig handle unstructured data while Hive can't? - comparison

According to this and other references Pig is better than Hive to process unstructured data. So, the data first cleansed with Pig and then processed with Hive.
But, in the data factory, data may not be in a nice, standardized state yet. This makes Pig a good fit for this use case as well, since it supports data with partial or unknown schemas, and semi-structured or unstructured data.
Would like to know more how Pig can handle unstructured data while Hive can't.

Pig is built to processes schema less data sets..whereas in hive we enforce a schema which is stored in derby or can be configured to store in mysql..Now it is not clear what you are looking for!

The key difference between Pig and Hive is that Pig is a dataflow language while Hive is a declarative language. With that being said, Pig can handle unstructured data with no schema defined whereas Hive requires a schema.Also, in some cases Pig can also be used to connect data with a schema giving it an upper hand over Hive. In contrast, Hive converts Hadoop into a dataware house and acts like a SQL dialect. Lastly, you might want to know about Jaql which is another dataflow language. Unlike Pig, its native data structure format is JSON. Similarly, Jaql does not require a schema. Hope this helps.

Related

graphQL join on common column

how we can do data join on a common column in graphQL.
For example in SQL : Select t.name and z.address where t.id=z.id;
how is this managed by graphQL query.

How is this managed by GraphQL query?
The short answer: It's not (managed). You have to write it in your resolvers.
Your GraphQL schema is basically just a declaration of types that represent nodes, and fields that represent edges in a graph. GraphQL is the language to query this graph. At each edge in your schema, you write a function for how to get the data when the query traverses that edge. This function is called a "resolver", and you can put pretty much any code in your resolvers as long as it returns valid data. This means GraphQL is completely, absolutely database-independent. Your resolvers are responsible for talking to your database(s).
So when it comes to SQL, and joins, the concern of making the queries is entirely the responsibility of the API developer. GraphQL knows nothing about SQL or joins, so this is essentially custom logic for your application which you must write. It turns out that building SQL queries to resolve data for GraphQL queries is very difficult without running into problems like eagerly fetching too much data or the "N+1" problem.
Some tools in the open-source community have emerged to help solve this problem. Here are a couple I would recommend in the Node.js space:
DataLoader - A database-agnostic tool that batches queries and caches individual records.
Join Monster - A SQL-tailored tool for efficient data-retrieval and query batching. It examines each query and your schema and generates SQL queries dynamically.
Disclaimer: I'm a co-creator of Join Monster.

Convert JSON to Parquet

I have a few TB logs data in JSON format, I want to convert them into Parquet format to gain better performance in analytics stage.
I've managed to do this by writing a mapreduce java job which uses parquet-mr and parquet-avro.
The only thing I'm not satisfied with is that, my JSON logs doesn't have a fixed schema, I don't know all the fields' names and types. Besides, even I know all the fields' names and types, my schema evolves as time goes on, for example, there will be new fields added in future.
For now I have to provide a Avro schema for AvroWriteSupport, and avro only allows fixed number of fields.
Is there a better way to store arbitrary fields in Parquet, just like JSON?

One thing for sure is that Parquet needs a Avro schema in advance. We'll focus on how to get the schema.
Use SparkSQL to convert JSON files to Parquet files.
SparkSQL can infer a schema automatically from data, thus we don't need to provide a schema by ourselves. Every time the data changes, SparkSQL will infer out a different schema.
Maintain an Avro schema manually.
If you don't use Spark but only Hadoop, you need to infer the schema manually. First write a mapreduce job to scan all JSON files and get all fields, after you know all fields you can write an Avro schema. Use this schema to convert JSON files to Parquet files.
There will be new unknown fields in future, every time there are new fields, add them to the Avro schema. So basically we're doing SparkSQL's job manually.

Use Apache Drill!
From https://drill.apache.org/docs/parquet-format/, in 1 line of SQL.
After setup Apache Drill (with or without HDFS), execute sqline.sh to run SQL queries:
// Set default format ALTER SESSION SET `store.format` = 'parquet';
ALTER SYSTEM SET `store.format` = 'parquet';
// Migrate data
CREATE TABLE dfs.tmp.sampleparquet AS (SELECT trans_id, cast(`date` AS date) transdate, cast(`time` AS time) transtime, cast(amount AS double) amountm, user_info, marketing_info, trans_info FROM dfs.`/Users/drilluser/sample.json`);
Should take a few time, maybe hours, but at the end, you have light and cool parquet files ;-)
In my test, query a parquet file is x4 faster than JSON and ask less ressources.

How to execute SQL statements on a dataset which didn't come from a database?

Suppose I have an application which fetches a custom XML packet from the server which represents a dataset. Then, suppose I wish to execute a SQL statement on that data via a dataset. What can I use to do this? I don't need to know the code necessarily, but just what to use to make this possible and a general explanation of how.
For example, I may fetch a list of customers in XML format from the server. Then, I can use any third-party parser to dump that XML data into some client dataset. Then, execute a query on that dataset, for example select * from customers where ZipCode = '12345' without fetching this data from the server again.
XML is not the only limitation, that's just an example. I might want to do the same to some application settings loaded from an INI file. Either way, the concept is that the original source of the data is unknown.
Whether the dataset stores its temporary data in the memory or on the disk doesn't matter, but it would be excellent if it could keep it in the disk.

TXQuery (http://code.google.com/p/txquery/) is a component that provides a local SQL engine for executing SQL queries against one or more TDataSets. The only issues I have had with it is updating data via a TDBGrid of a query joining multiple tables (TDataSets) - specifically which table is being updated.
AnyDac v6 (now FireDac) also has a local SQL engine. http://www.da-soft.com/anydac/docu/frames.html?frmname=topic&frmfile=Local_SQL.html
Edit: For the example SQL in your question, because it only involves a single table, you do this with just a Filter on the datatset. For example
ADataSet.Filtered := False;
ADataSet.Filter := 'ZipCode=' + QuotedStr('12345');
ADataSet.Filtered := True;

Such a feature can be done using a local database. You just insert the TDataSet result into a local in-memory (or file-based) stand-alone database, then you can use regular SQL queries on it, including JOIN.
You can for instance use SQLite3, or the free edition of NexusDB.
NexusDB embedded has the benefit of being a native Delphi database, so stick to the DB.pas TDataSet paradigm.
Another option is to use the so-called Virtual Table mechanism of SQLite3, which allows to expose any data (even from TDataSet, XML, JSON or in-memory objects) to the SQLite3 engine, just as regular tables. Then you can run SQL statements on those "virtual" tables, including JOINs. With this approach, you do not require to INSERT the data into regular tables, but the data remain in their original form. Of course, you will miss some performance features like indexes, which should be handled on the virtual table provider side. We use this feature as the database core of our mORMot ORM/SOA framework, and this is pretty powerful.

The general process that you want to perform is complicated by the difference in data representation. SQL data is stored in tables made up of distinguishable records. XML is a structured representation of data, but in tree form rather than table/row form.
Each of these data forms may be qualified by a schema that provides a context for the data.
You have two general paths that you can follow:
Take the XML, and based on the schema insert it into a set of interlinked tables, then perform the SQL query. - if you have the schema, you can use code generators to make a parser, and then based ont the parse tree, you can insert into a local db with tables constructed on the fly. You can set up my SQL pretty easily from https://dev.mysql.com/doc/refman/5.7/en/installing.html and then in your version of delphi make a connection to the database, first fill it in, then query. This would satisfy your desire to have the data stored on the disk. unless you purge the tables when done, the data are still available in the local machine db.
This seems like more work than:
Use Xpath or Xquery and work directly on the XML. For this, a package like saxon in your favorite environment, or expat in python would work nicely.
Let me know if either of these paths seems as if it may be fruitful.

Use CSV to populate Neo4j

I am very new for Neo4j. I am a learner of this graph database. I need to load a csv file into Neo4j database. I am trying from 2 days,I couldn't able to find good information of reading the csv file in to Neo4j. Please suggest me wil sample code or blogs of reading csv file into Neo4j.
Example:
Suppose if i have a csv file in This way how can we read it into Neo4j
id name language
1 Victor Richards West Frisian
2 Virginia Shaw Korean
3 Lois Simpson Belarusian
4 Randy Bishop Hiri Motu
5 Lori Mendoza Tok Pisin

You may want to try https://github.com/sroycode/neo4j-import
This populates data directly from a pair of CSV files ( entries must be COMMA separated )
To build: (you need maven)
sh build.sh
The nodes file has a mandatory field id and any other fields you like
NODES.txt
id,name,language
1,Victor Richards,West Frisian
2,Virginia Shaw,Korean
3,Lois Simpson,Belarusian
The relationships file has 3 mandatory fields from,to,type. Assuming you have a field age ( long integer), and info, the relations file will look like
RELNS.txt
from,to,type,age#long,info
1,2,KNOWS,10,known each other from school
1,3,CLUBMATES,5,member of country club
Running:
sh run.sh graph.db NODES.txt RELNS.txt
will create graph.db in the current folder which you can copy to the neo4j data folder.
Note:
If you are using neo4j later than 1.6.* , please add this line in conf/neo4j.properties
allow_store_upgrade = true
Have fun.

Please take a look at https://github.com/jexp/batch-import
Can be used as starting point

There is nothing available to generically load CSV data into Neo4j because the source and destination data structures are different: CSV data is tabular whereas Neo4j holds graph data.
In order to achieve such an import, you will need to add a separate step to translate your tabular data into some form of graph (e.g. a tree) before it can be loaded into Neo4j. Taking the tree structure further as an example, the page below shows how XML data can be converted into Cypher which may then be directly executed against a Neo4j instance.
http://geoff.nigelsmall.net/xml2graph/
Please feel free to use this tool if it helps (bear in mind it can only deal with small files) but this will of course require you to convert your CSV to XML first.
Cheers
Nigel

there is probably no known CSV importer for neo4j, you must import it yourself:
i usually do it myself via gremlin's g.loadGraphML(); function.
http://docs.neo4j.org/chunked/snapshot/gremlin-plugin.html#rest-api-load-a-sample-graph
i parse my data with some external script into the xml syntax and load the particular xml file. you can view the syntax here:
https://raw.github.com/tinkerpop/gremlin/master/data/graph-example-1.xml
parsing an 100mb file takes few minutes.
in your case what you need to do is a simple bipartite graph with vertices consisting of users and languages, and edges of "speaks". if you know some programming, then create user nodes with parameters id, name | unique language nodes with parameters name | relationships where you need to connect each user with the particular language. note that users can be duplicite whereas languages can't.

I believe your question is too generic. What does your csv file contain? Logical meaning of the contents of a csv file can vary very much. An example of two columns with IDs, which would represent entities connected to each other.
3921 584
831 9891
3841 92
...
In this case you could either write a BatchInserter code snippet which would import it faster, see http://docs.neo4j.org/chunked/milestone/batchinsert.html.
Or you could import using regular GraphDatabaseService with transaction sizes of a couple of thousands inserts for performance. See how to setup and use the graph db at http://docs.neo4j.org/chunked/milestone/tutorials-java-embedded.html.

Convert XML into a Dataset

I'm trying to convert an XML document into a dataset that I can import into a database (like SQLite or MySQL) that I can query from.
It's an XML file that holds most of the stuff in attributes. This is part of a Rails project so I'm very inclined to use Ruby (and that's the language I'm most comfortable with at the moment).
I'm not sure how to go about doing that and I'd welcome both high-level and low-level contributions.

xmlsimple can convert your xml into a Ruby object (or nested object) which you can then look over and do whatever you like with. Makes working XML in Ruby really easy. As Jim says though depends on your XML complexity and your needs.

There are three basic approaches:
Use ruby's xml stream parsing facilities to process the data with ruby code and write the appropriate rows to the database.
Transform the xml using xslt to a non-XML stream format and feed that into a ruby program that updates the database
Transform the xml with xslt into a format acceptable to the bulk-loading tool for whatever database you are using.
Only you can determine the best approach depending on the XML schema complexity and the type of mapping you have to perform to get it into relational format.
It might help if you could post a sample of the XML and the DB schema you have to populate.

Will it load model data? If you're on *nix take a look at libxml-ruby. =)
With it you can load the XML, and iteration through the nodes you can create your AR objects.

You can have a look at the XMLMapping gem. It lets you define different classes depending upon the structure of your XML. Now you can create objects from those classes.
Now you will have to write some module which actually converts these XMLMapping objects into ActiveRecord objects. Once those are converted to AR objects you can simply call save to save those objects into the corresponding tables.
It is a long solution but it will let you create objects out of your XML without iterating over it. XMLMapping will do it for you.

Have you considered loading the data into an XML database?
Without knowing what the structure of the data is, I have no idea what the benefits of an RDBMS over an XML DB are.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How does Pig handle unstructured data while Hive can't? - comparison

Pig is built to processes schema less data sets..whereas in hive we enforce a schema which is stored in derby or can be configured to store in mysql..Now it is not clear what you are looking for!

Related

graphQL join on common column

Convert JSON to Parquet

How to execute SQL statements on a dataset which didn't come from a database?

Use CSV to populate Neo4j

Convert XML into a Dataset

Categories

Resources