Hive IllegalStateException Ambiguous Input path - path

Im running a query in hive on a table with partitions.
select count(*) from activity where datestamp=2016-08-16
However the query throws the following exception
java.lang.IllegalStateException: Ambiguous input path hdfs://ip-172-29-1-53.us-west-2.compute.internal:8020/hive/dcm/activity/datestamp=2016-10-01/part-r-00000-41b9fc2f-101c-423a-901e-0f617c8fbd62.gz.parquet
at org.apache.hadoop.hive.ql.exec.MapOperator.getNominalPath(MapOperator.java:454)
at org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:501)
at org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1072)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:545)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:83)
Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.lang.IllegalStateException: Ambiguous input path hdfs://ip-172-29-1-53.us-west-2.compute.internal:8020/hive/dcm/activity/datestamp=2016-08-16/part-r-00000-1fd9aa5b-6e66-4bf9-b015-a940cbd6cc5a.gz.parquet
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I have checked that the path actually has partitions.I also used parquet tools jar to open up the file and does look like the file has data in the right format. Any leads on what is ambiguous about the path

We encountered the same problem like yours when there was an insert statement with dynamic partition executed earlier that might insert into existed partitions.
In order to restore service and prevent more severe problems that the possibly corrupted metadata(partition info) may lead to, a quick fix was then applied that:
We manually cleanse the partition metadata.
That is, we executed an alter table xxxx drop partition (tag >= 'yyyyyyyy'); DDL to drop all the partitions. (For an external table, this wouldn't invoke any HDFS operation. Data would be intact.)
And then:
Executed a msck repair table command.
After this fix, queries to that table became normal again.
So my guess is that the partition metadata may suggest there are more than one partitions point to the same path(so that it prompts the path is ambiguous).
To execute a hive query, the execute engine will first fetch metadata before dive into the underlying file system.

Related

Referencing ABS location for stored procedure Using Azure Data Factory

My ADF pipeline processes through many imput files and lands them on an ABS container.
A 3rd party stored proc has two params: filename and datasourcelocation and does a bulk insert into an Azure SQL DB.
For the Data Source Location I pass: landing/Vendor
For the Data File /Company_05_17_22_05_54.csv
The full ABS location is for a single file is: https://...use2dev01.blob.core.windows.net/landing/Vendor/Company_05_17_22_05_54.csv
The error message says
Execution fail against sql server. Sql error number: 12703. Error Message: Referenced external data source "landing/vendor" not found.
How should I be passing the ABS location to the proc?
The error message says
Execution fail against sql server. Sql error number: 12703. Error Message: Referenced external data source "landing/vendor" not found.
For this error you can create a new data source (eg myazureblobstorage1 and provided that name in the second BULK command)
Try to Check the TLS version of the storage account. sometimes TLS settings may cause this error
Try to create the external data source ,wait for mins before running the BULK Insert
Check the CSV file is formatted properly, it may cause other issues
Take a note of additional parameters in the BULK Insert command
Make sure Table column names must match with that of CSV file
If you want to bulk insert from Azure blob, please refer to following script of so thread
Please check if the data source exists , You can verify the name of the data source by querying sys.external_data_sources and referring this MsDoc
For more in detail, please refer below links:
Bulk insert from Azure blob storage to Azure SQL database
SQL Server BULK INSERT does not work with Azure Blob Storage emulator
https://learn.microsoft.com/en-us/azure/data-factory/quickstart-create-data-factory-portal

Is there a way to configure the filename for a Neo4j Desktop database dump file to exclude timestamp?

I'm a first time user of Neo4j and following a training course to install and learn the basics.
I've installed Neo4j Desktop on a Windows machine and can see that it comes with a demo DB called "Movie DBMS". I'm trying to follow steps to dump the database, by stopping the database, clicking on "..." and then "Dump".
The dump errors with the following error in the log file:
[2022-01-31 12:54:36.022] [error] Selecting JVM - Version:11.0.8+10-LTS, Name:OpenJDK 64-Bit Server VM, Vendor:Azul Systems, Inc.
java.nio.file.InvalidPathException: Illegal char <:> at index 128: C:\Users<me>.Neo4jDesktop\relate-data\projects<my project name>\movie-dbms-neo4j-31-Jan-2022-12:54:31.dump
It would appear that the automatic configuration for the dump file is adding a timestamp with includes colons (hh:mm:ss). How can I configure the file name to either exclude the timestamp or avoid using ":"?
Thanks.
I had no responses. But I've figured it out myself.
The answer was to use the command line to dump the database manually. At that point I can specify my own "--to=" filename which doesn't include a ":".
Details in this section of the manual: https://neo4j.com/docs/operations-manual/current/backup-restore/offline-backup/#offline-backup

Neo4j APOC import error

I have a data model that starts with a single record, this has a custom "recordId" that's a uuid, then it relates out to other nodes and they then in turn relate to each other. That starting node is what defines the data that "belongs" together, as in if we had separate databases inside neo4j. I need to export this data, into a backup data-set that can be re-imported into either the same or a new database with ease
After some help, I'm using APOC to do the export:
call apoc.export.cypher.query("MATCH (start:installations)
WHERE start.recordId = \"XXXXXXXX-XXX-XXX-XXXX-XXXXXXXXXXXXX\"
CALL apoc.path.subgraphAll(start, {}) YIELD nodes, relationships
RETURN nodes, relationships", "/var/lib/neo4j/data/test_export.cypher", {})
There are then 2 problems I'm having:
Problem 1 is the data that's exported has internal neo4j identifiers to generate the relationships. This is bad if we need to import into a new database and the UNIQUE IMPORT ID values already exist. I need to have this data generated with my own custom recordIds as the point of reference.
Problem 2 is that the import doesn't even work.
call apoc.cypher.runFile("/var/lib/neo4j/data/test_export.cypher") yield row, result
returns:
Failed to invoke procedure apoc.cypher.runFile: Caused by: java.lang.RuntimeException: Error accessing file /var/lib/neo4j/data/test_export.cypher
I'm hoping someone can help me figure out what may be going on, but I'm not sure what additional info is helpful. No one in the Neo4j slack channel has been able to help find a solution.
Thanks.
problem1:
The exported file does not contain any internal neo4j ids. It is not safe to use neo4j ids out of the database, since they are not globally unique. So you should not use them to transfer data from one database to another.
If you are about to use globally uniqe ids, you can use an external plugin like GraphAware UUID plugin. (disclaimer: I work for GraphAware)
problem2:
If you cannot access the file, then possible reasons:
apoc.import.file.enabled=true is not set in neo4j.conf
os level
permission is not set

Failure on CSV import into Neo4j 2.2.0-RC01

I'm having some weird issues when using the batch load into Neo4j 2.2.0-RC1. I am trying to import 10 different node sets (for different labels) along with 12 relationship files. The data sets vary in size - some node types have ~200-300k records, some are small (50-100 records). For most node types I have a separate file with a header and separate file with data for each of the sets (the data is generated from the DB and I want to be able to regenerate the dump files without worrying about preparing the :ID columns, describing data types etc.)
I am re-running the import task a number of times (with options --processors 1 --stacktrace) and I keep getting different errors (not a single change in the actual dataset) which makes me think it might be something concurrency-related. Sometimes import simply hangs with a message like this:
Nodes
[>:36.75 MB/s------------------------|*PROPERTIES-----------------------------------------|NOD|] 0
In most cases, it crashes with an error like below, except the number of nodes that it manages to import fine differs from run to run.
[>:27.23 MB/s-------------|*PROPERTIES--------------------------|NO|v:19.62 MB/s---------------]100kImport error: Panic called, so exiting
java.lang.RuntimeException: Panic called, so exiting
at org.neo4j.unsafe.impl.batchimport.staging.StageExecution.stillExecuting(StageExecution.java:63)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutionSupervisor.anyStillExecuting(ExecutionSupervisor.java:79)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutionSupervisor.finishAwareSleep(ExecutionSupervisor.java:102)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutionSupervisor.supervise(ExecutionSupervisor.java:64)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutionSupervisors.superviseDynamicExecution(ExecutionSupervisors.java:65)
at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.executeStages(ParallelBatchImporter.java:226)
at org.neo4j.unsafe.impl.batchimport.ParallelBatchImporter.doImport(ParallelBatchImporter.java:151)
at org.neo4j.tooling.ImportTool.main(ImportTool.java:263)
Caused by: java.lang.RuntimeException: Panic called, so exiting
at org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.assertHealthy(AbstractStep.java:189)
at org.neo4j.unsafe.impl.batchimport.staging.ProducerStep.process(ProducerStep.java:77)
at org.neo4j.unsafe.impl.batchimport.staging.ProducerStep$1.run(ProducerStep.java:54)
Caused by: java.lang.IllegalStateException: Nodes for any specific group must be added in sequence before adding nodes for any other group
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.put(EncodingIdMapper.java:137)
at org.neo4j.unsafe.impl.batchimport.NodeEncoderStep.process(NodeEncoderStep.java:76)
at org.neo4j.unsafe.impl.batchimport.NodeEncoderStep.process(NodeEncoderStep.java:41)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutorServiceStep$2.call(ExecutorServiceStep.java:96)
at org.neo4j.unsafe.impl.batchimport.staging.ExecutorServiceStep$2.call(ExecutorServiceStep.java:87)
at org.neo4j.unsafe.impl.batchimport.executor.DynamicTaskExecutor$Processor.run(DynamicTaskExecutor.java:217)
I managed to run it successfully once, which, again, seems to imply that some sort of timing issue is at play.
Unfortunately I cannot provide the datasets as they contain confidential data.
The weirdest thing of all is that if I split the load into 2 different sets (the datasets are almost separate subgraphs, they have only 2 relationships in common) then all works fine (so not likely to be data related), but even loading just nodes doesn't work if I put them all into a single command. And because it's not possible to force a load into an existing database, loading it in 2 steps is sadly not an option.
1) Is that a known issue and if so, any ETA on a fix / issue that I could follow?
2) If not, is there any troubleshooting I can do to get to the bottom of it? The messages.log file in the target DB directory contains VERY little output, it would be nice if I could get some more details on what's going wrong.
I've spotted the problem. Thanks for reporting/asking. The next release will include this fix. I see an additional set of integration tests for the import tool. I'll provide link to commit once it's in.

Error while trying to run an andhoc insert query using voltQueueSQLExperimental

I am getting an error while trying to execute a dynamic insert query in volt db using the voltQueueSQLExperimental() function. The SQL is fine as i ran it separately on the volt web studio. The error is as follows:
Error: VOLTDB ERROR: USER ABORT Attempted to queue DML adhoc sql
'insert into volt_temp_constraints
(asset_id,config_id,session_id,sam_id) values (12,13,'abc',12)' from
read only procedure at
procedures.testPrcUpdateConstraint.run(testPrcUpdateConstraint.java:155)
Please note that the SQL generated is dynamic and adhoc and this cannot be generated statically before hand.
Documentation is not their strength... ;), but I could reproduce your bug.
As I see it VoltDB marks compiled Procedures as read-writer or read-only. As one can infer from here. Unfortunately there currently does not seem to be any other way around it other than creating a INSERT/UPDTE/UPSERT SQLStatement as a Object Property and simply not using it.
Maybe you can contact one of the developers to add a some way on confutation for this.
By the way, the Exception can be found here: https://github.com/VoltDB/voltdb/blob/master/src/frontend/org/voltdb/ProcedureRunner.java in line 620

Resources