Read into Dask from Minio raises issue with reading / converting binary string JSON into utf8 - dask

I'm trying to read JSON-LD into Dask from Minio. The pipeline works but the strings come from Minio as binary strings
So
with oss.open('gleaner/summoned/repo/file.jsonld', 'rb') as f:
print(f.read())
results in
b'\n{\n "#context": "http://schema.org/",\n "#type": "Dataset",\n ...
I can simply convert this with
with oss.open('gleaner/summoned/repo/file.jsonld', 'rb') as f:
print(f.read().decode("utf-8"))
and now everything is as I expect it.
However, I am working with Dask and when reading into the a bag with
dgraphs = db.read_text('s3://bucket/prefa/prefb/*.jsonld',
storage_options={
"key": key,
"secret": secret,
"client_kwargs": {"endpoint_url":"https://example.org"}
}).map(json.loads)
I can not get the content coming from Minio to become strings vs binary strings. I need these converted before they hit the json.loads map I suspect.
I assume I can inject the "decode" in here somehow as well, but I can't resolve how.
Thanks

As the name implies, read_text opens the remote file in text mode, equivalent to open(..., 'rt'). The signature of read_text includes the various decoding arguments, such as UTF8 as the default encoding. You should not need to do anything else, but please post a specific error if you are having trouble, ideally with example file contents.
If your data isn't delimited by lines, read_text might not be right for you, and you can do something like
#dask.delayed()
def read_a_file(fn):
# or preferably open in text mode and json.load from the file
with oss.open('gleaner/summoned/repo/file.jsonld', 'rb') as f:
return json.loads(f.read().decode("utf-8"))
output = [read_a_file(f) for f in filenames]
and then you can create a bag or dataframe from this, as required.

Related

Ruby: Is there a way to specify your encoding in File.write?

TL;DR
How would I specify the mode of encoding on File.write, or how would one save image binary to a file in a similar fashion?
More Details
I'm trying to download an image from a Trello card and then upload that image to S3 so it has an accessible URL. I have been able to download the image from Trello as binary (I believe it is some form of binary), but I have been having issues saving this as a .jpeg using File.write. Every time I attempt that, it gives me this error in my Rails console:
Encoding::UndefinedConversionError: "\xFF" from ASCII-8BIT to UTF-8
from /app/app/services/customer_order_status_notifier/card.rb:181:in `write'
And here is the code that triggers that:
def trello_pics
#trello_pics ||=
card.attachments.last(config_pics_number)&.map(&:url).map do |url|
binary = Faraday.get(url, nil, {'Authorization' => "OAuth oauth_consumer_key=\"#{ENV['TRELLO_PUBLIC_KEY']}\", oauth_token=\"#{ENV['TRELLO_TOKEN']}\""}).body
File.write(FILE_LOCATION, binary) # doesn't work
run_me
end
end
So I figure this must be an issue with the way that File.write converts the input into a file. Is there a way to specify encoding?
AFIK you can't do it at the time of performing the write, but you can do it at the time of creating the File object; here an example of UTF8 encoding:
File.open(FILE_LOCATION, "w:UTF-8") do
|f|
f.write(....)
end
Another possibility would be to use the external_encoding option:
File.open(FILE_LOCATION, "w", external_encoding: Encoding::UTF_8)
Of course this assumes that the data which is written, is a String. If you have (packed) binary data, you would use "wb" for openeing the file, and syswrite instead of write to write the data to the file.
UPDATE As engineersmnky points out in a comment, the arguments for the encoding can also be passed as parameter to the write method itself, for instance
IO::write(FILE_LOCATION, data_to_write, external_encoding: Encoding::UTF_8)

Handling Binary (excel) file in Multi-data Post data in Suave.IO

I am trying to build a simple Suave.IO application to centralize the sending of emails. Currently the application has one endpoint that takes subject, body, recipients, attachments, and sender as form data and turns them into an EWS email message from a logging email account.
Everything works as intended in most cases, but I get a file corruption issue when one of the attachments is an excel file. In those cases, the file seems to get corrupted.
Currently, I am filtering the request.multipartFields down to only the ones that are marked as attachment files, and then doing this:
for (fileField: (string*string)) in fileFields do
let fname = (fst fileField)
let fpath = "uploadedFiles\\" + fname
File.WriteAllBytes(fpath, Encoding.ASCII.GetBytes (snd fileField)) |> ignore
The file path and the attachment names are then fed into the EWS message before sending.
Again, this seems to work with all attachments except attachments with binary. It seems like Suave.IO automatically encodes all multiPartFields as (string*string), which may require special handling when it's binary data.
How should I handle upload of binary files?
Thanks all in advance.
It looks like the issue was one of encoding. I was testing using python's request interface, and by default the files are encoded as multipart/form-data. By specifying a specific encoding for each file, I was able to help the server identify the incoming data as a file.
instead of
requests.post(url, data=data, files={filename: open(filepath, 'rb')})
I needed to make it
requests.post(url, data=data, files={filename: (filename, open(filepath, 'rb'), mimetypes.guess(filepath)})
With the second python script, files do end up in the files section of the request and I was able to save the excel file without corruption.

How to correctly handle character encoding when using Postgresql's copy_data function?

In my Rails app, I managed to stream large CSV files directly from Postgres based on solutions mentioned in this SO post. My working code looks somewhat like so:
query = <A Long SQL Query String>
response.headers["Cache-Control"] = "no-cache"
response.headers["Content-Type"] = "text/csv; charset=utf-8"
response.headers["Content-Disposition"] =
%(attachment; filename="#{csv_filename}")
response.headers["Last-Modified"] = Time.now.ctime.to_s
conn = ActiveRecord::Base.connection.raw_connection
conn.copy_data("COPY (#{query}) TO STDOUT WITH (FORMAT CSV, HEADER TRUE, FORCE_QUOTE *, ESCAPE E'\\\\');") do
while row = conn.get_copy_data
response.stream.write row
end
end
response.stream.close
end
Some of the columns (VARCHAR) being queried have values as either English or Chinese strings. The CSV file resulting from the above code doesn’t show the Chinese characters as is. Instead, I get something like this:
大大 文文
Am I supposed to change the way I’m using the copy_data function, or is there something I could do to the CSV file to solve this? I’ve tried saving the file as UTF-8 .txt file, as well as trying the convert_to function mentioned in the copy_data documentation, but to no avail.
This depends of the original encoding included in the CSV file.
Do this on Linux :
file -i you_file
Are you sure it's not UTF-16 or GB 18030 ?
And also in what kind of encoding is setup your database ?
do a \l in psql to see this.
So it boiled down to my MS Excel not being able to render the Chinese chars correctly. On MacOS, opening the same .csv file using the Numbers app (or even Atom, for that matter) resolved this issue for me.

How to read string stored in hdf5 format files by DM

I am scripting with DM and would like to read hdf5 file format.
I borrowed Tore Niermann's gms_HDF5_Plug-In (hdf5_GMS2X_amd64.dll) and his CMD_import_hdf5.s script. It use h5_read_dataset(filename, datapath) to read a image dataset.
I am trying to figure out the way to read a string info stored in the same file. I am particular interested to read the angle stored in string as shown in this figure.Demonstrated string to read. The h5_read_dataset(filename, datapath) function doesn't work for reading string.
There is a help file (hdf5_plugin.chm) with a list of functions but unfortunately I can't open them to see more info.
hdf5_plugin.chm showing the function list.
I suppose the right function to read strings should be something like h5_read_attr() or h5_info() but I didn't test them out. DM always says the two functions doesn't exist.
After reading out the angle by string, I will also need a bit help to convert the string to a double datatype.
Thank you.
Converting String to Number is done with the Val() command.
There is no integer/double/float concept for variables in DM-script, all are just number. ( This is different for images, where you can define the numeric type. Also: For file-inport/export a type differntiation can be made using the taggroup streaming commands in the other answer. )
Example script:
string numStr = "1.234e-2"
number num = val( numStr )
ClearResults()
Result( "\n As string:" + numStr )
Result( "\n As value:" + num )
Result( "\n As value, formatted:" + Format(num,"%3.2f") )
Potential answer regarding the .chm files: When you download (or email) .chm files in Windows, the OS classifies them as "potentially dagerouse" (because it could contain executable HTML code, I think). As a result, these files can not be shown by default. However, you can right-click these files and "unblock" them in the file properties.
Example:
I think this will be most likely a question specific to that plugin and not general DM scripting. So it might be better to contact the plugin-author directly.
The alternative (not good) solution would be to "rewrite" your own HDF5 file-reader, if you know the file-format. For this you would need the "Streaming" commands of the DM script language and browse through the (binary?) source file to the apropriate file location. The starting point for reading on this in the F1 help documentation would be here:

String Stream in Prolog?

I have to work with some SWI-Prolog code that opens a new stream (which creates a file on the file system) and pours some data in. The generated file is read somewhere else later on in the code.
I would like to replace the file stream with a string stream in Prolog so that no files are created and then read everything that was put in the stream as one big string.
Does SWI-Prolog have string streams? If so, how could I use them to accomplish this task? I would really appreciate it if you could provide a small snippet. Thank you!
SWI-Prolog implements memory mapped files. Here is a snippet from some old code of mine, doing both write/read
%% html2text(+Html, -Text) is det.
%
% convert from html to text
%
html2text(Html, Text) :-
html_clean(Html, HtmlDescription),
new_memory_file(Handle),
open_memory_file(Handle, write, S),
format(S, '<html><head><title>html2text</title></head><body>~s</body></html>', [HtmlDescription]),
close(S),
open_memory_file(Handle, read, R, [free_on_close(true)]),
load_html_file(stream(R), [Xml]),
close(R),
xpath(Xml, body(normalize_space), Text).
Another option is using with_output_to/2 combined with current_output/1:
write_your_output_to_stream(Stream) :-
format(Stream, 'example output\n', []),
format(Stream, 'another line', []).
str_out(Codes) :-
with_output_to(codes(Codes), (
current_output(Stream),
write_your_output_to_stream(Stream)
)).
Usage example:
?- portray_text(true), str_out(C).
C = "example output
another line"
Of course, you can choose between redirecting output to atom, string, list of codes (as per example above), etc., just use the corresponding parameter to with_output_to/2:
with_output_to(atom(Atom), ... )
with_output_to(string(String), ... )
with_output_to(codes(Codes), ... )
with_output_to(chars(Chars), ... )
See with_output_to/2 documentation:
http://www.swi-prolog.org/pldoc/man?predicate=with_output_to/2
Later on, you could use open_string/2, open_codes_stream/2 and similar predicates to open string/list of codes as an input stream to read data.

Resources