Upload data in ckan automatically using job scheduler - upload

I have a ckan website.I upload data manually to the datastore, its working perfectly. However my actual requirement is to automate the process.I want a job scheduler which automatically upload data like geojson,excel,csv,pdf,etc in the ckan application.
Please provide inputs
Thanks

You could write a bash (or python) script that calls the CKAN API using the ckanapi program. Use the action function create_package or probably more likely create_resource. This example, including uploading the file, is in the ckanapi's README:
$ ckanapi resource_create package_id=my-dataset-with-files \
upload=#/path/to/file/to/upload.csv \
url=dummy-value # ignored but required by CKAN<2.6
If this is a regular automatable thing then you probably don't want to add a new CKAN dataset each time, as that implies the metadata for that dataset is the same each time, and that doesn't sound helpful for the user - you probably want a new resource each time instead. If the only thing that has changed between each data file is the date, with everything else the same (the purpose, data structure, method of collection, people involved) then it makes more sense to create a single dataset and each update is a new resource in it.

Related

Rails save log data to database

Is it possible to access to the information being saved into a rails log file without reading the log file. To be clear I do not want to send the log file as a batch process but rather every event that is written into the log file I want to also send as a background job to a separate database.
I have multiple apps running in docker containers and wish to save the log entries of each into a shared telemetry database running on the server. Currently the logs are formatted with lograge but I have not figured out how to access this information directly and send it to a background job to be processed.(as stated before I would like direct access to the data being written to the log and send that via a background job)
I am aware of the command Rails.logger.instance_variable_get(:#logger) however what I am looking for is the actual data being saved to the logs so I can ship it to a database.
The reasoning behind this is that there are multiple rails api's running in docker containers. I have an after action set up to run a background job that I hoped would send just the individual log entry but this is where I am stuck. Sizing isn't an issue as the data stored in this database to be purged every 2 weeks. This is moreso a tool for the in-house devs to track telemetry through a dashboard. I appreciate you taking the time to respond
You would probably have to go through your app code and manually save the output from the logger into a table/field in your database inline. Theoretically, any data that ends up in your log should be accessible from within your app.
Depending on what how much data you're planning on saving this may not be the best idea as it has the potential to grow your database extremely quickly (it's not uncommon for apps to create GBs worth of logs in a single day).
You could write a background job that opens the log files, searches for data, and saves it to your database, but the configuration required for that will depend largely on your hosting setup.
So I got a solution working and in fairness it wasn't as difficult as I had thought. As I was using the lograge gem for formatting the logs I created a custom formatter through the guide in this Link.
As I wanted the Son format I just copied this format but was able to put in the call for a background job at this point and also cleanse some data I did not want.
module Lograge
module Formatters
class SomeService < Lograge::Formatters::Json
def call(data)
data = data.delete_if do |k|
[:format, :view, :db].include? k
end
::JSON.dump(data)
# faktory job to ship data
LogSenderJob.perform_async(data)
super
end
end
end
end
This was just one solution to the problem that was made easier as I was able to get the data formatted via lograge but another solution was to create a custom logger and in there I could tell it to write to a database if necessary.

Upload and process a spreadsheet without storing it permanently

I want to provide an option to export data via a spreadsheet. I don't want to store it permanently (hence there's no need of storage services like S3). What would the most most efficient and scalable way of doing this? Where can I temporarily store this file while it is being processed? Here's what should happen:
List item
User uploads spreadsheet
My backend processes it and updates the DB
Discard the spreadsheet
My 2 requirements are efficiency and scalability.
If I was you i would look for a way to parse the XLS/CSV on the front-end and sending JSON to your backend. This way you pass slow /intensive work to the client (scalability) and process only JSON on the server.
You can start here:
https://stackoverflow.com/a/37083658/1540290
I'm assuming you have a form with a file input to pick the xls file you want to process like this:
<input id="my_model_source" type="file" name="my_model[source]">
To process the xls you could use roo gem.
Option 1:
In some controller (where you are processing the file) you can receive the file like this: params[:my_model][:source]. This file will be an ActionDispatch::Http::UploadedFile instance. This class has the instance method path that will give you a temp file to work with.
So, with roo gem, yo can read it like this:
xls = Roo::Spreadsheet.open(params[:my_model][:source].path, extension: :xlsx)
Option 2:
The option one will work if your importing process is not too heavy.
If indeed, is too heavy you can use Active Job to handle the processing in background.
If you choose Active Job, you:
will lose the opportunity to use ActionDispatch::Http::UploadedFile's path method. You will need to generate the temp file on your own. To achieve this you could use cp command to copy the ActionDispatch::Http::UploadedFile's path wherever you want. After use it you can deleted with rm commnad
will lose a real time response. To handle this you could use Job Notifier gem
I have tried to show roughly what paths you can take.

CKAN upload custom format tsv on schedule

I am trying to upload custom formatted data files from the UK climate site e.g.this file. There are 5 lines of metadata and 1 header line.
1) Can CKAN preprocess the file according to a format I give it so that only data are picked up. Possibly saving the metadata in the description?
I would prefer a frontend option because I want users to be able to do this themselves.
2) Is it possible to have a dataset uploaded automatically once the url is entered. I currently have to go to the manage -> datastore page and click on upload to datastore to have the data populated.
3) Can the dataset be updated at a regular interval?
Thanks
Not currently. Doing ETL on incoming data is something that is discussed a lot recently, so it may happen soon.
You shouldn't have to manually trigger a load into datastore. Is this when creating a new resource, or if you're editing an existing resource? When editing a resource I believe it is only triggered if the URL changes.
You can use https://github.com/ckan/ckanext-harvest to have data pulled into CKAN on a regular schedule - there are harvesters for various different stores, so it depends on where it is updated from.

Mvc azure storage, auto delete storage after certain time

Im developing a azure website where users can upload blob and metadata. I want uploaded stuff too be deleted after some time.
The only way i can think off is going for a cloudapp instead of a website with a worker role that checks like every hour if the uploaded file has expired and continue and delete it. However im going for a simple website here without workerroles.
I have a function that checks if the uploaded item should be deleted and if the user do something on the page i can easily call this function, BUT.. If the user isnt doing anything and the time runs out it wont delete it because the user never calls the function.. The storage will never be deleted. How would you solve this?
Thanks
Too broad to give one right answer, as you can solve this in many ways. But... from an objective perspective because you're using Web Sites I do suggest you look at Web Jobs and see if this might be the right tool for you (as this gives you the ability to run periodic jobs without the bulk of extra VMs in web/worker configuration). You'll still need a way to manage your metadata to know what to delete.
Regarding other Azure-specific built-in mechanisms, you can also consider queuing delete messages, with an invisibility time equal to the time the content is to be available. After that time expires, the queue message becomes visible, and any queue consumer would then see the message and be able to act on it. This can be your Web Job (which has SDK support for queues) or really any other mechanism you build.
Again, a very broad question with no single right answer, so I'm just pointing out the Azure-specific mechanisms that could help solve this particular problem.
Like David said in his answer, there can be many solutions to your problem. One solution could be to rely on blob itself. In this approach you can periodically fetch the list of blobs in the blob container and decide if the blob should be removed or not. The periodic fetching could be done through a Azure WebJob (if application is deployed as a website) or through a Azure Worker Role. Worker role approach is independent of how your main application is deployed. It could be deployed as a cloud service or as a website.
With that, there are two possible approaches you can take:
Rely on Blob's Last Modified Date: Whenever a blob is updated, its Last Modified property gets updated. You can use that to identify if the blob should be deleted or not. This approach would work best if the uploaded blob is never modified.
Rely on Blob's custom metadata: Whenever a blob is uploaded, you could set the upload date/time in blob's metadata. When you fetch the list of blobs, you could compare the upload date/time metadata value with the current date/time and decide if the blob should be deleted or not.
Another approach might be to use the container name to be the "expiry date"
This might make deletion easier, as you then could just remove expired containers

Attaching/uploading files to not-yet-saved Note - what is best strategy for this?

In my application, I have a textarea input where users can type a note.
When they click Save, there is an AJAX call to Web Api that saves the note to the database.
I would like for users to be able to attach multiple files to this note (Gmail style) before saving the Note. It would be nice if the upload could start as soon as attached, before saving the note.
What is the best strategy for this?
P.S. I can't use jQuery fineuploader plugin or anything like that because I need to give the files unique names on the server before uploading them to Azure.
Is what I'm trying to do possible, or do I have to make the whole 'Note' a normal form post instead of an API call?
Thanks!
This approach is file-based, but you can apply the same logic to Azure Blob Storage containers if you wish.
What I normally do is give the user a unique GUID when they GET the AddNote page. I create a folder called:
C:\TemporaryUploads\UNIQUE-USER-GUID\
Then any files the user uploads at this stage get assigned to this folder:
C:\TemporaryUploads\UNIQUE-USER-GUID\file1.txt
C:\TemporaryUploads\UNIQUE-USER-GUID\file2.txt
C:\TemporaryUploads\UNIQUE-USER-GUID\file3.txt
When the user does a POST and I have confirmed that all validation has passed, I simply copy the files to the completed folder, with the newly generated note ID:
C:\NodeUploads\Note-100001\file1.txt
Then delete the C:\TemporaryUploads\UNIQUE-USER-GUID folder
Cleaning Up
Now. That's all well and good for users who actually go ahead and save a note, but what about the ones who uploaded a file and closed the browser? There are two options at this stage:
Have a background service clean up these files on a scheduled basis. Daily, weekly, etc. This should be a job for Azure's Web Jobs
Clean up the old files via the web app each time a new note is saved. Not a great approach as you're doing File IO when there are potentially no files to delete
Building on RGraham's answer, here's another approach you could take:
Create a blob container for storing note attachments. Let's call it note-attachments.
When the user comes to the screen of creating a note, assign a GUID to the note.
When user uploads the file, you just prefix the file name with this note id. So if a user uploads a file say file1.txt, it gets saved into blob storage as note-attachments/{note id}/file1.txt.
Depending on your requirement, once you save the note, you may move this blob to another blob container or keep it here only. Since the blob has note id in its name, searching for attachments for a note is easy.
For uploading files, I would recommend doing it directly from the browser to blob storage making use of AJAX, CORS and Shared Access Signature. This way you will avoid data going through your servers. You may find these blog posts useful:
Revisiting Windows Azure Shared Access Signature
Windows Azure Storage and Cross-Origin Resource Sharing (CORS) – Lets Have Some Fun

Resources