Apache Beam / GCP Dataflow encoding issue - character-encoding

i am "playing" with apache beam/dataflow in datalab.
I am trying to read a csv file from gcs.
when i create the pcollection using:
lines = p | 'ReadMyFile' >> beam.io.ReadFromText('gs://' + BUCKET_NAME + '/' + input_file, coder='StrUtf8Coder')
I get the following error:
LookupError: unknown encoding: "THE","NAME","OF","COLUMNS"
it seems the name of columns is interpreted as encoding?
I do not understand what's wrong.
If i do not specify the "coder" i get
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 1045: invalid continuation byte
Outside apache beam I am able to handle this error by reading the file from gcs:
blob = storage.Blob(gs_path, bucket)
data = blob.download_as_string()
data.decode('utf-8', 'ignore')
I read apache beam only support utf8 and the file does not contain only utf8.
Should I download and then convert to pcollection?
Any suggestion?

A possible hack is to create a class that inherits from the Coder class (apache_beam.coders.coders.Coder)
from apache_beam.coders.coders import Coder
class ISOCoder(Coder):
"""A coder used for reading and writing strings as ISO-8859-1."""
def encode(self, value):
return value.encode('iso-8859-1')
def decode(self, value):
return value.decode('iso-8859-1')
def is_deterministic(self):
return True
and pass it as an argument to the ReadFromText IO transform (apache_beam.io.textio.ReadFromText) provided by beam
like this
from apache_beam.io import ReadFromText
with beam.Pipeline(options=pipeline_options) as p:
new_pcollection = ( p | 'Read From GCS' >>
beam.io.ReadFromText('input_file', coder=ISOCoder())
The logic behind this detailed here
https://medium.com/#khushboo_16578/cloud-dataflow-and-iso-8859-1-2bb8763cc7c8

I would suggest changing the coding on the actual file. If you save the file with "Save as" you can select UTF-8 encoding for the format on excel CSVs and regular .txt. Once you do that you need to make sure you add a line of code like
class DoWork(beam.DoFn):
def process(self, text):
text = textfilePcollection.encode('utf-8')
Do other stuff
This isn't how I would like to do it because it isn't code-centric, but it has work for me before. Unfortunately, I don't have a code-centric solution.

Related

Reading text file in Lua using Luacom and ADODB: error

I am constructing a general purpose function to read a text file, which may be Ascii, UTF-8 or UTF-16. (The encoding is known when the function is invoked). The file name may contain UTF8 characters, so the standard lua io functions are not a solution. I have no control over the Lua implementation (5.3) or the binary modules available in the environment.
My current code is:
require "luacom"
local function readTextFile(sPath, bUnicode, iBits)
local fso = luacom.CreateObject("Scripting.FileSystemObject")
if not fso:FileExists(sPath) then return false, "" end --check the file exists
local so = luacom.CreateObject("ADODB.Stream")
--so.CharSet defaults to Unicode aka utf-16
--so.Type defaults to text
so.Mode = 1 --adModeRead
if not bUnicode then
so.CharSet = "ascii"
elseif iBits == 8 then
so.CharSet = "utf-8"
end
so:Open()
so:LoadFromFile(sPath)
local contents = so:ReadText()
so:Close()
return true, contents
end
--test Unicode(utf-16) files
local file = "D:\\OneDrive\\Desktop\\utf16.txt" --this exists
local booOK, factsetcontents = readTextFile(file, true, 16)
When executed I get the error: COM exception:(d:\my\lua\luacom-master\src\library\tluacom.cpp,382):Operation is not allowed in this context on line 19 [local stream = so:LoadFromFile(sPath)]
I've pored over the ADO documentation and am obviously missing something that is staring me in the face! Is what I'm trying to do impossible?
ETA: If I comment out the line so.Mode = 1, this works. Which is great, but I don't understand why, which meaans I may end up making the same mistake unwittingly, whatever that mistake is!
I don't know about AdoDB Stream.Mode and why the function failed. But I think it's rather tricky to use a ADODB COM object on Windows to read ASCII/UTF8/UNICODE encoded files.
You can instead :
use standard Lua io.open function in binary mode and use manual decoding of the bytes content
use a binary module to do all the work
use a specific Lua implementation for Windows that can read/write those kind of encoded files natively, like LuaRT

Beam/Dataflow read Parquet files and add file name/path to each record

I'm using Apache Beam Python SDK and I'm trying to read data from Parquet files using apache_beam.io.parquetio, but I also want to add the filename (or path) to the data since it contains data as well. I went over the suggested pattern here and read that Parquetio is similar to fileio but it doesn't seem like it implements the functionality that allows to go over files and add that to the party.
Anyone figured a good way to implement this?
Thanks!
If the number of files is not tremendous, you can get all the files before you read them through the IO.
import glob
filelist = glob.glob('/tmp/*.parquet')
p = beam.Pipeline()
class PairWithFile(beam.DoFn):
def __init__(self, filename):
self._filename = filename
def process(self, e):
yield (self._filename, e)
file_with_records = [
(p
| 'Read %s' % (file) >> beam.io.ReadFromParquet(file)
| 'Pair %s' % (file) >> beam.ParDo(PairWithFile(file)))
for file in filelist
] | beam.Flatten()
Then your PCollection looks like this:

Read into Dask from Minio raises issue with reading / converting binary string JSON into utf8

I'm trying to read JSON-LD into Dask from Minio. The pipeline works but the strings come from Minio as binary strings
So
with oss.open('gleaner/summoned/repo/file.jsonld', 'rb') as f:
print(f.read())
results in
b'\n{\n "#context": "http://schema.org/",\n "#type": "Dataset",\n ...
I can simply convert this with
with oss.open('gleaner/summoned/repo/file.jsonld', 'rb') as f:
print(f.read().decode("utf-8"))
and now everything is as I expect it.
However, I am working with Dask and when reading into the a bag with
dgraphs = db.read_text('s3://bucket/prefa/prefb/*.jsonld',
storage_options={
"key": key,
"secret": secret,
"client_kwargs": {"endpoint_url":"https://example.org"}
}).map(json.loads)
I can not get the content coming from Minio to become strings vs binary strings. I need these converted before they hit the json.loads map I suspect.
I assume I can inject the "decode" in here somehow as well, but I can't resolve how.
Thanks
As the name implies, read_text opens the remote file in text mode, equivalent to open(..., 'rt'). The signature of read_text includes the various decoding arguments, such as UTF8 as the default encoding. You should not need to do anything else, but please post a specific error if you are having trouble, ideally with example file contents.
If your data isn't delimited by lines, read_text might not be right for you, and you can do something like
#dask.delayed()
def read_a_file(fn):
# or preferably open in text mode and json.load from the file
with oss.open('gleaner/summoned/repo/file.jsonld', 'rb') as f:
return json.loads(f.read().decode("utf-8"))
output = [read_a_file(f) for f in filenames]
and then you can create a bag or dataframe from this, as required.

python xlrd: convert xls to csv using tempfiles. Tempfile is empty

I am downloading an xls file from the internet. It is in .xls format but I need 'Sheet1' to be in csv format. I use xlrd to make the conversion but seem to have run into an issue where the file I write to is empty?
import urllib2
import tempfile
import csv
import xlrd
url_2_fetch = ____
u = urllib2.urlopen(url_2_fetch)
wb = xlrd.open_workbook(file_contents=u.read())
sh = wb.sheet_by_name('Sheet1')
csv_temp_file = tempfile.TemporaryFile()
with open('csv_temp_file', 'wb') as f:
writer = csv.writer(f)
for rownum in xrange(sh.nrows):
writer.writerow(sh.row_values(rownum))
That seemed to have worked. But now I want to inspect the values by doing the following:
with open('csv_temp_file', 'rb') as z:
reader = csv.reader(z)
for row in reader:
print row
But I get nothing:
>>> with open('csv_temp_file', 'rb') as z:
... reader = csv.reader(z)
... for row in reader:
... print row
...
>>>
I am using a tempfile because I want to do more parsing of the content and then use SQLAlchemy to store the contents of the csv post more parsing to a mySQL database.
I appreciate the help. Thank you.
This is completely wrong:
csv_temp_file = tempfile.TemporaryFile()
with open('csv_temp_file', 'wb') as f:
writer = csv.writer(f)
The tempfile.TemporaryFile() call returns "a file-like object that can be used as a temporary storage area. The file will be destroyed as soon as it is closed (including an implicit close when the object is garbage collected)."
So your variable csv_temp_file contains a file object, already open, that you can read and write to, and will be deleted as soon as you call .close() on it, overwrite the variable, or cleanly exit the program.
So far so good. But then you proceed to open another file with open('csv_temp_file', 'wb') that is not a temporary file, is created in the script's current directory with the fixed name 'csv_temp_file', is overwritten every time this script is run, can cause security holes, strange bugs and race conditions, and is not related to the variable csv_temp_file in any way.
You should trash the with open statement and use the csv_temp_file variable you already have. You can try to .seek(0) on it before using it again with the csv reader, it should work. Call .close() on it when you are done with it and the temporary file will be deleted.

How to open Excel file written with incorrect character encoding in VBA

I read an Excel 2003 file with a text editor to see some markup language.
When I open the file in Excel it displays incorrect characters. On inspection of the file I see that the encoding is Windows 1252 or some such. If I manually replace this with UTF-8, my file opens fine. Ok, so far so good, I can correct the thing manually.
Now the trick is that this file is generated automatically, that I need to process it automatically (no human interaction) with limited tools on my desktop (no perl or other scripting language).
Is there any simple way to open this XL file in VBA with the correct encoding (and ignore the encoding specified in the file)?
Note, Workbook.ReloadAs does not function for me, it bails out on error (and requires manual action as the file is already open).
Or is the only way to correct the file to go through some hoops? Either: text in, check line for encoding string, replace if required, write each line to new file...; or export to csv, then import from csv again with specific encoding, save as xls?
Any hints appreciated.
EDIT:
ADODB did not work for me (XL says user defined type, not defined).
I solved my problem with a workaround:
name2 = Replace(name, ".xls", ".txt")
Set wb = Workbooks.Open(name, True, True) ' open read-only
Set ws = wb.Worksheets(1)
ws.SaveAs FileName:=name2, FileFormat:=xlCSV
wb.Close False ' close workbook without saving changes
Set wb = Nothing ' free memory
Workbooks.OpenText FileName:=name2, _
Origin:=65001, _
DataType:=xlDelimited, _
Comma:=True
Well I think you can do it from another workbook. Add a reference to AcitiveX Data Objects, then add this sub:
Sub Encode(ByVal sPath$, Optional SetChar$ = "UTF-8")
Dim stream As ADODB.stream
Set stream = New ADODB.stream
With stream
.Open
.LoadFromFile sPath ' Loads a File
.Charset = SetChar ' sets stream encoding (UTF-8)
.SaveToFile sPath, adSaveCreateOverWrite
.Close
End With
Set stream = Nothing
Workbooks.Open sPath
End Sub
Then call this sub with the path to file with the off encoding.

Resources