Find error record file while processing too many files in same bucket in apache beam java sdk - google-cloud-dataflow

I have 20 files (csv files) in the same bucket. I am able to read all the file in one go and load on to bigquery. But when there is some data type mismatches, im able to get that row into invalidDataTag where as i am unable to find the file name that has the error record.
inputFilePattern is gs://bucket-name/* this picks up all the files that are present under the bucket. and reading the files as below
PCollection<String> sourceData = pipeline.apply(Constants.READ_CSV_STAGE_NAME, TextIO.read().from(options.getInputFilePattern()));
Is there a way where i can find the file name that has the error row in it ?

My suggestion would be to add a column to the BigQuery table that indicates which file the record came from.

Related

How to output results to only 1 static excel file in Mapforce?

I am trying to check Log files for Errors and write the name of each file into an Excel register file.
Basically if I have 60 log files and 30 of them have Error I want to have 30 rows added to one excel file. And each time I run this mapping new rows to be added to that excel.
How can I do that? So far I have tried a couple of things, but none have actually done anything for me:
I have tried to specify the excel file in Outputfilename box, I added Sheet1 and the workbook name, but nothing.

Informatica Cloud (IICS) filename using wildcard

In Data Integration module, I have created a mapping to load data from Csv file to Oracle table. I want to give a file pattern as the file name will have date in it. When I try to provide file pattern in the Source object, it is throwing the below error.
If someone can assist on letting me know how to load file with a file pattern, it will be very helpful.
Please let me know if you need any further details.
Try using a File Listener as the source for this mapping. In File Listener settings you can provide the pattern - and in turn, File Listener will trigger the mapping with the file found.

NEO4J: Couldn't load the external resource at: file:/var/lib/neo4j/import/

I am running Neo4J on Docker within Vagrant.
I am attempting to LOAD CSV WITH HEADERS from a file within the /import/ directory (I had to move my file there) via a cURL request. My request looks something like this:
"LOAD CSV WITH HEADERS FROM \"file:///insert-neo4j.csv\" AS row ...
This provides me with the following error:
{"results":[],"errors [{"code":"Neo.ClientError.Statement.ExternalResourceFailed","message":"Couldn't load the external resource at: file:/var/lib/neo4j/import/insert-neo4j.csv"}]}
It is often suggested to me that I append the following to my '/conf/neo4j.conf' file, however this file DOES NOT EXIST, and creating it manually does not seem to work...
dbms.directories.import=import
dbms.security.allow_csv_import_from_file_urls=true
So I created the file /conf/neo4j.conf with the above variables, and I also tried adding these as environment variables to my docker-compose file. I seem to continuously have no luck uploading via CSV this way.
My questions are:
Is there anything blatantky wrong with this implementation?
Why does my /conf/neo4j.conf file NOT exist and how can I get it created?
Thank you
(p.s. my insert-neo4j.csv has -rwxr-xr-x)
The error message indicates it found the file but there is an error in the CSV ... most likely the formatting. Check this and if you can't see it, please post a few rows, including the header, of it so we might help.

How to use flume for uploading zip files to hdfs sink

I am new to flume.My flume agent having source as http server,from where it getting zip files(compressed xml files) on regular interval.This zip files are very small (less than 10 mb) and i want to put the zip files extracted into the hdfs sink.Please share some idea how to do this.Do i have to go for a custom interceptor.
Flume will try to read your files line by line, except if you configure a specific deserializer. A deserializer lets you control how the file is parsed and split into events. You could of course follow the example of the blob deserizalizer, which is designed for PDFs and such, but I understand that you actually want to unpack them and then read them line by line. In that case you would need to write a custom deserializer which reads Zip and writes line by line events.
Here's the reference in the documentation:
https://flume.apache.org/FlumeUserGuide.html#event-deserializers

Read data from External data sheet - eggPlant

Is there a way to read data from an external data sheet like excel, Text file etc. in eggPlant?
When running the same script for various set of Input parameters this would prove useful for me instead of hardcoding the values..
-Siva
Since this is the most viewed Eggplant question, I'll give it a more robust answer.
Yes! Using data from a data file is a fantastic way to parameterize your test without hardcoding!
Saving Data
To do so, you have to save your data in .csv or .txt format, within the Suite's Resources directory. This allows you to open and interact it from within Eggplant Functional.
Importing Data
In your script, you can reference these data files with just their filename, for example,
put ResourcePath("myData.txt") into FilePath
will save the entire file myData.txt from the Resources directory into a variable FilePath.
Accessing Data
You can then access each row of that file like any other file.
put line 1 of file FilePath into Name
put line 2 of file FilePath into DOB
If you save your data as a .csv, you can specify a row and column of a specific piece of data.
put item 2 in line 1 of file FilePath into Last_Name
Read more about reading files in the Eggplant Documentation!
For more complicated resource files, read this page in the Eggplant Documentation!
1. Enter the data in the excel sheet and save it as a CSV file.
2. Piece of code:
repeat with theData= each line of file "D:\TestData.csv"
log item 1 of theData
end repeat

Resources