Azure Data Lake Store concurrency - webhdfs

I've been toying with Azure Data Lake Store and in the documentation Microsoft claims that the system is optimized for low-latency small writes to files. Testing it out I tried to perform a big amount of writes on parallel tasks to a single file, but this method fails in most cases returning a Bad Request. This link https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf shows that HDFS isn't made to handle concurrent appends on a single file, so I tried a second time using the ConcurrentAppendAsync method found in the API, but although the method doesn't crash, my file's never modified on the store.

What you have found out is correct about how parallel writes will work. I am assuming you have already read the documentation of ConcurrentAppendAsync.
So, in your case, did you use the same file for the Webhdfs write test and the ConcurrentAppendAsync? If that's the case, then ConcurrentAppendAsync will not work, as mentioned in the documentation. But you should have got an error in that case.
In any case, let us know what happened and we can investigate further.
Thanks,
Sachin Sheth
Program Manager - Azure Data Lake

Related

How to handle SAP Kapsel Offline app OData conflicts properly?

I build an app that is able to store OData offline by using SAP Kapsel Plugins.
More or less it's the same as generated by WEB ID or similer to the apps in this example: https://blogs.sap.com/2017/01/24/getting-started-with-kapsel-part-10-offline-odatasp13/
Now I am at the point to check the error resolution potential. I created a sync conflict (chaning data on the server after the offline database was stored and changed something on the app and started a flush).
As mentioned in the documentation I can see the error in ErrorArchive and could also see some details. But what I am missing is the information of the "current" data on the database.
In the error details I can just see the data on the device but not the data changed on the server.
For example:
Device is loading some names into offline store
Device is offline
User A is changing some names
User B is changing one of this names directly online
User A is online again and starts a sync
User A is now informend about the entity that was changed BUT:
not the content user B entered
I just see the "offline" data.
Is there a solution to see the "current" and the "offline" one in a kind of compare view?
Please also note that the server communication is done by the Kapsel Plugin and not with normal AJAX calls. This could be an alternative but I am wondering if there is no smarter way supported by the API?
Meanwhile I figured out how to load the online data (manually).
This could be done by switching http handler back to normal one.
sap.OData.removeHttpClient();
sap.OData.applyHttpClient();
Anyhow this does not look like a proper solution and I also have the issue with the conflict log itself. It must be deleted before any refresh could be applied.
I could not find any proper documentation for that. Also ETag handling is hardly described in SAPUI5 and SAP Kapsel documentation.
This question is a really tricky one, due to its implications. I understand that you are simulating a synchronization error due to concurrent modification, and want to know if there is a way for the client to obtain the "current" server state in order to give the user a means to compare the local and server state.
First, let me give you the short answer: No, there is no way for the client to see the current server state "for reference" via the Offline APIs when there are synchronization errors. Doing an online query as outlined above might work, but it certainly is a bad idea.
Now for the longer answer, which explains why this is not necessarily a defect and why I said there are quite some implications to the answer.
Types of Synchronization Errors
We distinguish a number of synchronization errors, and in this context, we are clearly dealing with business-related issues. There are two subtypes here: Those that the user can correct, e.g. validation errors, and those that are issues in the business process itself.
If the user violates the input range, e.g. by putting a negative price for a product, the server would reply with the corresponding message: "-1 is not a valid input value for 'Price'". You, as a developer, can display such messages to the user from the error archive, and the ensuing fix is indeed a very easy one.
Now when we talk about concurrent modification, things get really, really nasty. In fact, I like to say that in this case there is an issue with the business process, because on one hand, we allow data to get out of sync. On the other hand, the process allows multiple users to manipulate the same piece of information. How all relevant users should now be notified and synchronize, is no longer just a technical detail, but in fact a new business process. There just is no way to generically device how to handle this case. In most cases, it would involve back-office experts who need to decide how the changes should be merged.
A Better Solution
Angstrom pointed out that there is no way to manipulate ETags on the client side, and you should in fact not even think about it. ETags work like version numbers in optimistic locking scenarios, and changing the ETag basically means "Just overwrite what's on the server". This is a no-go in serious scenarios.
An acceptable workaround would be the following:
Make sure the server returns verbose error messages so that the user can see what happened and what caused the conflict.
If that does not help, refresh the data. This will get you an updated ETag, and merge the local changes into the "current" server state, but only locally. "Merging" really means that local changes always overwrite remote changes.
The user now has another opportunity to review the data and can submit it again.
A Good Solution
Better is not necessarily good, so here is what you should really do: Never let concurrent modification happen because it is really expensive to handle. This implies that not the developer should address this issue, but the business needs to change the process.
The right question to ask is, "When you replicate data in a distributed system, why do you allow it to be modified concurrently at all?" Typically stakeholders will not like this kind of question, and the appropriate reaction is to work out a conflict resolution process together with them. Only then they will realize how expensive fixing that kind of desynchronization is, and more often than not they will see that adjusting the process is way cheaper than insisting in yet another back-office process to fix the issues it causes. Even if they insist that there is a need for this concurrent modification, they will now understand that it is not your task to sort this out and that they need to invest in a conflict resolution process.
TL;DR
There is no way to compare the server and client state to the server state on the client, but you can do a refresh to retain the local changes and get an updated ETag. The real solution, however, is to rework the business process, because this no longer is a purely technical issue.
The default solution is that SMP or HCPms is detecting errors by ETags. At client side there is no API to manipulate ETags in case of conflicts. A potential solution to implement a kind of diff view on the device would work like this:
Show errors
Cache errors (maybe only in memory?)
delete the errors
do a refresh of the database
build a diff view with current data and cached errors
The idea with
sap.OData.removeHttpClient();
sap.OData.applyHttpClient();
could also work but could be very tricky and may introduce side effects.
Maybe some requests are triggered against the "wrong" backend.

Ignore/skip GCS input files that don't exist

Our requirement is to process the last 24 hours of adserving logs that Google DFP writes directly to our GCS bucket.
We currently achieve this by using a Flatten, and passing in all the file names for the last 24 hours. The file names are in yyyyMMdd_hh format.
But, we've identified that sometimes DFP fails to write a file(s) for some of the hours. We've raised that issue to the DFP guys.
However, is there a way to configure our Dataflow job to ignore any missing GCS files, and not fail in that case? It currently fails if one or more files don't exist.
Using Dataflow APIs like TextIO.Read or AvroIO.Read to read from a non-existent file will, of course, thrown an error and cause the pipeline to fail. This is working as intended and I cannot think of a workaround.
Now, reading from a filepattern like yyyyMMdd_* may solve your problem, at least partially. Dataflow will expand the filepattern into a set of files and process them. As long as at least one file exists that matches the pattern provided, the pipeline should proceed.
The approach of having one source per file is often an anti-pattern -- it is less efficient and less elegant, but functionally the same. Nevertheless, you can still fix it by using the Google Cloud Storage API before constructing your Dataflow pipeline to confirm presence of each file. If an input file is not present, you can simply skip generating one of the sources.
Either way, please keep in mind the eventual consistency guarantee provided by the GCS list API. This means that expanding a file pattern may not immediately generate all files that would otherwise be readable. The anti-pattern may be a good workaround for this case, however.
Maybe not the best answer, but you can always use
GcsUtilFactory.create(options).expand(...)
to grab all files which exist. Then you can create Flatten accordingly.
Waiting for more professional answers.

Syncing a local sqlite file to iCloud

I store some data in my iOS app directly in a local .sqlite file.  I chose to do this instead of CoreData because the data will need to be compatible with non-Apple platforms.
Now, I'm trying to come up with the best way to sync this file over iCloud.  I know you can't sync it directly, for many reasons.  I know CoreData is able to sync its DBs, but even ignoring that using CD would essentially lock this file into Apple platforms (I think? I've only looked into CD a bit), I need the iCloud syncing of this file to work across ALL of iCloud's supported platforms - which is supposed to include Windows.  I have to assume that there won't be any compatibility for the CoreData files in the Windows API.  Planning out the best way to accomplish this would be a lot easier if Apple would tell us any more than "There will be a Windows API [eventually?]"
In addition, I'll eventually need to implement at least one more sync service to support platforms that iCloud does not.  It would be helpful, though not required, if the method I use for iCloud can be mostly reused for future services.
For these reasons, I don't think CoreData can help me with this.  Am I correct in thinking this?
Moving on from there, I need to devise an algorithm for this, or find an existing one or an existing 3rd party solution.  I haven't stumbled across anything yet. However, I have been mulling over a couple possible methods I could implement:
Method 1:
Do something similar to how CoreData syncs sqlite DBs: send "transaction logs" to iCloud instead and build each local sqlite file off of those.
I'm thinking each device would send a (uniquely named) text file listing all the sql commands that that device executed, with timestamps.  The device would store how far along in each list of commands it has executed, and continue from that point each time the file is updated. If it received updates to multiple log files at once, it would execute each command in timestamp order.
Things could get 'interesting' efficiency-wise once these files get large, but it seems like a solvable problem.  
Method 2:
Periodically sync a copy of the working database to iCloud.  Have a modification timestamp field in every record.  When an updated copy of the DB comes through, query all the records with newer timestamps than some reference time and update the record in the local DB from the new data.
I see many potential problems with this method:
-Have to implement something further to recognize record deletion.
-The DB file could get conflicts. It might be possible to deal with them by handling each conflict version in timestamp order.
-Determining the date to check each update from could be tricky, as it depends on which device the update is coming from.
There are a lot of potential problems with method 2, but method 1 seems doable to me...
Does anyone have any suggestions as to what might be the best course of action? Any better ideas than my "Method 1" (or reasons why it wouldn't work)?
Try those two solutions from Ray Wenderlich:
Exporting/Importing data through mail:
http://www.raywenderlich.com/1980/how-to-import-and-export-app-data-via-email-in-your-ios-app
File Sharing with iTunes:
http://www.raywenderlich.com/1948/how-integrate-itunes-file-sharing-with-your-ios-app
I found it quite complex but helped me a lot.
Both method 1 and method 2 seem doable. Perhaps a combination of the two in fact - use iCloud to send a separate database file that is a subset of data - i.e. just changed items. Or maybe another file format instead of sqlite db - XML/JSON/CSV etc.
Another alternative is to do it outside of iCloud - i.e. a simple custom web service for syncing. So each change gets submitted to a central server via JSON/XML over HTTP, and then other devices pull updates from that.
Obviously it depends how much data and how many devices you want to sync across, and whether you have access to an appropriate server and/or budget to cover running such a server. iCloud will do that for "free" but all it really does is transfer files. A custom solution allows you to define your syncing model as you wish, but you have to develop and manage it and pay for it.
I've considered the possibility of transferring a database file through iCloud but I think that I would run into classic problems of timing - slow start for the user - and corrupted databases if the app is run on multiple devices simultaneously. (iPad/iPhone for example).
Sooo. I've had to use the transaction logs method. It really is difficult to implement, but once in place, seems ok.
I am using Apple's SharedCoreData sample as the base for this work. This link requires an Apple Developer Account.
I did find a much much better solution from Tim Roadley however this only works for IOS and I needed both IOS and MacOS.
rant> iCloud development really has to get easier and more stable! /rant

parse an active log file

Looking for a little help getting started on a little project i've had in the back of my mind for a while.
I have log file(s) varying in size depending on how often they are cleaned from 50-500MB. I'd like to write a program that will monitor the log file while its actively being written to. when in use it's being changed pretty quickly easily several hundred lines a second or so. Most if not all of the examples i've seen for reading log/text files are simply open and read file contents into a variable which isn't really feasible to do every time the file changes in this situation. I've not settled on a language to write this in but its on a windows box and I can work in .net flavors / java / or php ( heh dont think php will fly to well for this), and can likely muddle through another language if someone has a suggestion for something well built for handling this.
Essentially I believe what I'm looking for would probably be better described to as a high speed way of monitoring a text file for changes and seeing what those changes are. Each line written is relatively small. (less than 300 characters, so its not big data on each line).
EDIT: to change the wording to hopefully better describe what i'm trying to do. Which is write a program to keep an eye on a log file for a trigger then match a following action to that trigger. So my question here is pertaining to file handling inside a programming language.
I greatly appreciate any thoughts/comments.
If it's incremental then you can just read the whole file the first time you start analyzing logs, then you keep the current size as n. Next time you check (maybe a timed action to check last modified date) just skip first n bytes, read all new bytes and update size.
Otherwise you could use tail -f by getting its stdout and using it for your purposes..
The 'keep an eye on a log file' part of what you are describing is what tail does.
If you plan to implement it in Java, you can check this question: Java IO implementation of unix/linux "tail -f" and add your trigger logic to lines read.
I suggest not reinventing the wheel.
Try using the elastic.co
All of these applications are open source and free and are capable of monitoring (together) and trigger actions based on input.
filebeats - will read the log file line by line (supports multiline log messages as well) and will send it across to logstash. There are loads of other shippers you can use.
logstash - will take the log messages, filter them, add tags and send the messages to elasticsearch
elasticsearch - will take the log messages and index them, the store them. It is also capable of running actions based on input
kibana - is a user friendly web interface to query and analyze the data. Or just simply put it up on a dashboard.
Hope this helps.

Techniques for writing critical text data

We take text/csv like data over long periods (~days) from costly experiments and so file corruption is to be avoided at all costs.
Recently, a file was copied from the Explorer in XP whilst the experiment was in progress and the data was partially lost, presumably due to multiple access conflict.
What are some good techniques to avoid such loss? - We are using Delphi on Windows XP systems.
Some ideas we came up with are listed below - we'd welcome comments as well as your own input.
Use a database as a secondary data storage mechanism and take advantage of the atomic transaction mechanisms
How about splitting the large file into separate files, one for each day.
If these machines are on a network: send a HTTP post with the logging data to a webserver.
(sending UDP packets would be even simpler).
Make sure you only copy old data. If you have a timestamp on the filename with a 1 hour resolution, you can safely copy the data older than 1 hour.
If a write fails, cache the result for a later write - so if a file is opened externally the data is still stored internally, or could even be stored to a disk
I think what you're looking for is the Win32 CreateFile API, with these flags:
FILE_FLAG_WRITE_THROUGH : Write operations will not go through any intermediate cache, they will go directly to disk.
FILE_FLAG_NO_BUFFERING : The file or device is being opened with no system caching for data reads and writes. This flag does not affect hard disk caching or memory mapped files.
There are strict requirements for successfully working with files opened with CreateFile using the FILE_FLAG_NO_BUFFERING flag, for details see File Buffering.
Each experiment much use a 'work' file and a 'done' file. Work file is opened exclusively and done file copied to a place on the network. A application on the receiving machine would feed that files into a database. If explorer try to move or copy the work file, it will receive a 'Access denied' error.
'Work' file would become 'done' after a certain period (say, 6/12/24 hours or what ever period). So it create another work file (the name must contain the timestamp) and send the 'done' through the network ( or a human can do that, what is you are doing actually if I understand your text correctly).
Copying a file while in use is asking for it being corrupted.
Write data to a buffer file in an obscure directory and copy the data to the 'public' data file periodically (every 10 points for instance), thereby reducing writes and also providing a backup
Write data points discretely, i.e. open and close the filehandle for every data point write - this reduces the amount of time the file is being accessed provided the time between data points is low

Resources