New to Azure data factory.
We have .CSV file, we need to use ODATA and load to Snowflake table.
I created 2 datasets, 2 linked services and 1 1 pipeline.
I loaded the data to Snowflake table with screenshot attached
The issue which iam facing now is, one comment column is there. DATE format
"DD/MM/YYYY."
The date column is coming as NULL in output.
When i view the data in source, the date value able to see, but in Derived expression builder, the date value is NULL
The flow is
Odata - > Blob Storage (JSON)
JSON -> Derived -> Target
Any help would be appreciated.
Thanks,
Related
When I pull data using ODATA in excel I get time with -3hrs compared to the one in the instance.
Is there a way I can get the exact time as the one on the instance?
Comaparison of instance time and Excel time pic
I tried to check the server time to make sure its the same as the one on the instance. My SQL server
is in UTC. could this be the reason why? It seems ODATA might be fetching data directly from the database through the generic inquiry.
the problem is that the records were saved in the wrong timezone (server or SQL)
In the GI take the column that is wrong and add this:
=DateAdd(today(),’h’, 3)
Replace 'today()' with your needed field, now the data will display as you want them
I have a ~1 GB text file with 153 separate fields. I uploaded the file to GCS and then created a new table in BQ with file format as "CSV". For table type, I selected "native table". For schema, I elected to auto-detect. For the field delimiter, I selected "tab". Upon running the job, I received the following error:
Could not parse '15229-1910' as INT64 for field int64_field_19 (position 19) starting at location 318092352 with message 'Unable to parse'
The error is originating out of a "zip code plus 4" field. My question is if there is a way to prevent the field from parsing this value or if there's a way to omit these parse errors altogether so that the job can complete? From GCP's documentation, they advise "If BigQuery doesn't recognize the format, it loads the column as a string data type. In that case, you might need to preprocess the source data before loading it". The "zip code plus four" field in my file is already assigned as a string field type, therefore, I'm not quite sure where to go from here. Being that I selected the delimiter as "tab", does that indicate that the "zip code plus for" value contains a tab character?
BigQuery uses auto-detect schema to detect the schema of a table while loading data into the BigQuery. As per the sample data provided by you, pincode will be considered as string value by BigQuery due to the presence of dash”-” in between the integer values. If you want to provide schema, you can avoid using auto-detect and give schema manually.
As stated in the comment, you can try this to upload your 1 GB text file into Bigquery by following the steps :
As mentioned by you in the question assuming your data is in the CSV format. From the given sample data, I have mocked the data in excel sheet.
Excel Sheet
Save the file in .tsv format.
You can upload the file into BigQuery using auto-detect schema and setting tab as delimiter. It will automatically detect all the field types without any error as can be seen in the table in BigQuery in the screenshot.
BigQuery Table
I have a requirement for the final output text delimited document to contain only dates, however the json which I am reading as the Source in my ADF Copy Activity has the datetime fields as "hireDate": "1988-03-28T00:00:00+00:00", for example; I need these to be "1988-03-28". Any suggestions on doing this in the ADF Mapping (and we don't have Data Flow Mapping in the government Azure Cloud).
Thank you
Mike Kiser
No choice, we need an intermediate process, eg: We can copy the json file into a Azure SQL table, then copy from Azure SQL into a txt file.
Create a table and set the column type to date.
create table dbo.data1(
hireDate date
)
Copy into the table. It automatically casts the type from datetime to date .
The debug result is as follows:
I'm trying to set up a Dataflow-SQL job to run a query in BigQUery and publish the results to a PubSub topic. I'm not using a Dataflow template, I'm using the GCP's Dataflow SQL UI to write a query and configure the output - i.e. PubSub Topic.
The table I'm querying contains String, Date, Timestamp, and Numeric types.
Even if I don't select the column with 'Numeric' data type, I still get a validation error in the editor - unsupported column type NUMERIC.
Is there a way to get around this in Dataflow SQL? Or the source table just can't have columns of Numeric Type?
Numeric types in Dataflow SQL are INT64 and FLOAT64 (8 bytes) but not NUMERIC (16 bytes).
I reproduced your issue on my end and it certainly looks like the table cannot be loaded in the first place, even if you are not selecting the NUMERIC column.
Hi after performing a group by key on a KV Pcollection, I need to:-
1) Make every element in that PCollection a separate individual PCollection.
2) Insert the records in those individual PCollections into a BigQuery Table.
Basically my intention is to create a dynamic date partition in the BigQuery table.
How can I do this?
An example would really help.
For Google Dataflow to be able to perform the massive parallelisation which makes it as one of its kind (as a service on the public cloud), the job flow needs to be predefined before submitting it to on the Google cloud console. Everytime you execute the jar file that conatins your pipleline code (which includes pipeline options and the transforms), a json file with the description of the job is created and submitted to Google cloud platform. The managed service then uses this to execute your job.
For the use case mentioned in the question, it demands that the input PCollection be split into as many PCollections as their are unique dates. For the split, the Tuple Tags needed to split the collection should be created dynamically which is not possible at this time. Creating tuple tags dynamically is not allowed because that doesn't help in creating the job description json file and beats the whole design/purpose with which dataflow was built.
I can think of a couple of solutions to this problem (both having its own pros and cons) :
Solution 1 (a workaround for the exact use case in the question):
Write a dataflow transform that takes the input PCollection and for each element in the input -
1. Checks the date of the element.
2. Appends the date to a pre-defined Big Query Table Name as a decorator (in the format yyyyMMDD).
3. Makes an HTTP request to the BQ API to insert the row into the table with the table name added with a decorator.
You will have to take into consideration the cost perspective in this approach because there is single HTTP request for every element rather than a BQ load job that would have done it if we had used the BigQueryIO dataflow sdk module.
Solution 2 (best practice that should be followed in these type of use cases):
1. Run the dataflow pipeline in the streaming mode instead of batch mode.
2. Define a time window with whatever is suitable to the scenario in which it is being is used.
3. For the `PCollection` in each window, write it to a BQ table with the decorator being the date of the time window itself.
You will have to consider rearchitecting your data source to send data to dataflow in the real time but you will have a dynamically date partitioned big query table with the results of your data processing being near real time.
References -
Google Big Query Table Decorators
Google Big Query Table insert using HTTP POST request
How job description files work
Note: Please mention in the comments and I will elaborate the answer with code snippets if needed.