How to debug msgpack serialisation issue in Google Cloud Dataflow job? - google-cloud-dataflow

I have a Google Cloud Dataflow job with which I would like to extract named entities from a text using a specific spacy model neural coref.
Running the extraction without beam I can extract entities but when I try to run it with the DirectRunner the job fails due to a serialisation error from msgpack. I am not sure how to proceed in debugging this problem.
My requirements are quite barebones with requirements of:
apache-beam[gcp]==2.4
spacy==2.0.12
ujson==1.35
The issue might be something related to how spacy and beam are interplaying as the stacktrace shows spacy spouting out some of its methods which it shouldn't be doing.
Weird spacy log behaviour from stacktrace:
T4: <class 'entity.extract_entities.EntityExtraction'>
# T4
D2: <dict object at 0x1126c0398>
T4: <class 'spacy.lang.en.English'>
# T4
D2: <dict object at 0x1126b54b0>
D2: <dict object at 0x1126d1168>
F2: <function is_alpha at 0x11266d320>
# F2
F2: <function is_ascii at 0x112327c08>
# F2
F2: <function is_digit at 0x11266d398>
# F2
F2: <function is_lower at 0x11266d410>
# F2
F2: <function is_punct at 0x112327b90>
# F2
F2: <function is_space at 0x11266d488>
# F2
F2: <function is_title at 0x11266d500>
# F2
F2: <function is_upper at 0x11266d578>
# F2
F2: <function like_url at 0x11266d050>
# F2
F2: <function like_num at 0x110d55140>
# F2
F2: <function like_email at 0x112327f50>
# F2
Fu: <functools.partial object at 0x11266c628>
F2: <function _create_ftype at 0x1070af500>
# F2
T1: <type 'functools.partial'>
F2: <function _load_type at 0x1070af398>
# F2
# T1
F2: <function is_stop at 0x11266d5f0>
# F2
D2: <dict object at 0x1126b7168>
T4: <type 'set'>
# T4
# D2
# Fu
F2: <function is_oov at 0x11266d668>
# F2
F2: <function is_bracket at 0x112327cf8>
# F2
F2: <function is_quote at 0x112327d70>
# F2
F2: <function is_left_punct at 0x112327de8>
# F2
F2: <function is_right_punct at 0x112327e60>
# F2
F2: <function is_currency at 0x112327ed8>
# F2
Fu: <functools.partial object at 0x110d49ba8>
F2: <function _get_attr_unless_lookup at 0x1106e26e0>
# F2
F2: <function lower at 0x11266d140>
# F2
D2: <dict object at 0x112317c58>
# D2
D2: <dict object at 0x110e38168>
# D2
D2: <dict object at 0x112669c58>
# D2
# Fu
F2: <function word_shape at 0x11266d0c8>
# F2
F2: <function prefix at 0x11266d1b8>
# F2
F2: <function suffix at 0x11266d230>
# F2
F2: <function get_prob at 0x11266d6e0>
# F2
F2: <function cluster at 0x11266d2a8>
# F2
F2: <function _return_en at 0x11266f0c8>
# F2
# D2
B2: <built-in function unpickle_vocab>
# B2
T4: <type 'spacy.strings.StringStore'>
# T4
My current hypothesis is that perhaps there is some problem with my setup.py but I am not sure what is causing the issue currently.
The full stacktrace is:
/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/msgpack_numpy.py:183: DeprecationWarning: encoding is deprecated, Use raw=False instead.
return _unpackb(packed, **kwargs)
/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/msgpack_numpy.py:132: DeprecationWarning: encoding is deprecated.
use_bin_type=use_bin_type)
T4: <class 'entity.extract_entities.EntityExtraction'>
# T4
D2: <dict object at 0x1126c0398>
T4: <class 'spacy.lang.en.English'>
# T4
D2: <dict object at 0x1126b54b0>
D2: <dict object at 0x1126d1168>
F2: <function is_alpha at 0x11266d320>
# F2
F2: <function is_ascii at 0x112327c08>
# F2
F2: <function is_digit at 0x11266d398>
# F2
F2: <function is_lower at 0x11266d410>
# F2
F2: <function is_punct at 0x112327b90>
# F2
F2: <function is_space at 0x11266d488>
# F2
F2: <function is_title at 0x11266d500>
# F2
F2: <function is_upper at 0x11266d578>
# F2
F2: <function like_url at 0x11266d050>
# F2
F2: <function like_num at 0x110d55140>
# F2
F2: <function like_email at 0x112327f50>
# F2
Fu: <functools.partial object at 0x11266c628>
F2: <function _create_ftype at 0x1070af500>
# F2
T1: <type 'functools.partial'>
F2: <function _load_type at 0x1070af398>
# F2
# T1
F2: <function is_stop at 0x11266d5f0>
# F2
D2: <dict object at 0x1126b7168>
T4: <type 'set'>
# T4
# D2
# Fu
F2: <function is_oov at 0x11266d668>
# F2
F2: <function is_bracket at 0x112327cf8>
# F2
F2: <function is_quote at 0x112327d70>
# F2
F2: <function is_left_punct at 0x112327de8>
# F2
F2: <function is_right_punct at 0x112327e60>
# F2
F2: <function is_currency at 0x112327ed8>
# F2
Fu: <functools.partial object at 0x110d49ba8>
F2: <function _get_attr_unless_lookup at 0x1106e26e0>
# F2
F2: <function lower at 0x11266d140>
# F2
D2: <dict object at 0x112317c58>
# D2
D2: <dict object at 0x110e38168>
# D2
D2: <dict object at 0x112669c58>
# D2
# Fu
F2: <function word_shape at 0x11266d0c8>
# F2
F2: <function prefix at 0x11266d1b8>
# F2
F2: <function suffix at 0x11266d230>
# F2
F2: <function get_prob at 0x11266d6e0>
# F2
F2: <function cluster at 0x11266d2a8>
# F2
F2: <function _return_en at 0x11266f0c8>
# F2
# D2
B2: <built-in function unpickle_vocab>
# B2
T4: <type 'spacy.strings.StringStore'>
# T4
Traceback (most recent call last):
File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/Users/chris/coref_entity_extraction/main.py", line 29, in <module>
run()
File "/Users/chris/coref_entity_extraction/main.py", line 24, in run
entities = records | 'ExtractEntities' >> beam.ParDo(EntityExtraction())
File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/apache_beam/transforms/core.py", line 784, in __init__
super(ParDo, self).__init__(fn, *args, **kwargs)
File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 638, in __init__
self.fn = pickler.loads(pickler.dumps(self.fn))
File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/apache_beam/internal/pickler.py", line 204, in dumps
s = dill.dumps(o)
File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/dill/dill.py", line 259, in dumps
dump(obj, file, protocol, byref, fmode, recurse)#, strictio)
File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/dill/dill.py", line 252, in dump
pik.dump(obj)
File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 425, in save_reduce
save(state)
File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/apache_beam/internal/pickler.py", line 172, in new_save_module_dict
return old_save_module_dict(pickler, obj)
File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/dill/dill.py", line 841, in save_module_dict
StockPickler.save_dict(pickler, obj)
File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 655, in save_dict
self._batch_setitems(obj.iteritems())
File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 692, in _batch_setitems
save(v)
File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 425, in save_reduce
save(state)
File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/apache_beam/internal/pickler.py", line 172, in new_save_module_dict
return old_save_module_dict(pickler, obj)
File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/dill/dill.py", line 841, in save_module_dict
StockPickler.save_dict(pickler, obj)
File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 655, in save_dict
self._batch_setitems(obj.iteritems())
File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 687, in _batch_setitems
save(v)
File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 401, in save_reduce
save(args)
File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 568, in save_tuple
save(element)
File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 306, in save
rv = reduce(self.proto)
File "vectors.pyx", line 108, in spacy.vectors.Vectors.__reduce__
File "vectors.pyx", line 409, in spacy.vectors.Vectors.to_bytes
File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/spacy/util.py", line 485, in to_bytes
serialized[key] = getter()
File "vectors.pyx", line 404, in spacy.vectors.Vectors.to_bytes.serialize_weights
File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/msgpack_numpy.py", line 165, in packb
return Packer(**kwargs).pack(o)
File "msgpack/_packer.pyx", line 282, in msgpack._cmsgpack.Packer.pack
File "msgpack/_packer.pyx", line 288, in msgpack._cmsgpack.Packer.pack
File "msgpack/_packer.pyx", line 285, in msgpack._cmsgpack.Packer.pack
File "msgpack/_packer.pyx", line 232, in msgpack._cmsgpack.Packer._pack
File "msgpack/_packer.pyx", line 279, in msgpack._cmsgpack.Packer._pack
TypeError: can not serialize 'buffer' object
I have no idea about how to debug this issue with beam. To reproduce the whole issue I have setup a repo with instructions about how to set everything: https://github.com/swartchris8/coref_barebones

Are you able to run the same code from a regular Python program (not from a Beam DoFn) ?
If not, check whether you are storing any non-serializable state in a Beam DoFn (or any other function that will be serialized by Beam). This prevents Beam runners from serializing these functions (to be sent to workers) hence should be avoided.

In the end I got rid of the above the issue by changing the package versions installed. I do think it debugging the beam setup process is quite painful though my approach was just to manually try different package permutations.

Related

Is there an R function equivalent to wrapping assignment with `()`?

I just discovered that running (y <- 1:4) prints the result in the console, but ...
(y <- 1:4)
# [1] 1 2 3 4
library(dplyr) # for the pipe operator %>%
y <- 1:4 %>% funct_parenthesis() # DOES IT EXIST ?
# [1] 1 2 3 4
print works
library(dplyr) # for the pipe operator %>%
y <- 1:4 %>% print
# [1] 1 2 3 4

CSV::MalformedCSVError: New line must be <"\n\r">

Trying to parse this file with Ruby CSV.
https://www.sec.gov/files/data/broker-dealers/company-information-about-active-broker-dealers/bd070219.txt
However, I am getting an error.
CSV.open(file_name, "r", { :col_sep => "\t", :row_sep => "\n\r" }).each do |row|
puts row
end
CSV::MalformedCSVError: New line must be <"\n\r"> not <"\r"> in line
1.
Windows row_sep is "\r\n", not "\n\r". However this CSV is malformed. Looking at it using a hex editor it appears to be using "\r\r\n".
It's tab-delimited.
In addition it is not using proper quoting, line 247 has 600 "B" STREET STE. 2204, so you need to turn off quote characters.
quote_char: nil, col_sep: "\t", row_sep: "\r\r\n"
There's an extra tab on the end, each line ends with \t\r\r\n. You can also look at it as using a row_sep of "\r\n" with an extra \r field.
quote_char: nil, col_sep: "\t", row_sep: "\r\n"
Or you can view it as having a row_sep of \t\r\r\n and no extra field.
quote_char: nil, col_sep: "\t", row_sep: "\t\r\r\n"
Either way, it's a mess.
I used a hex editor to look at the file as text and raw data side by side. This let me see what's truly at the end of the line.
87654321 0011 2233 4455 6677 8899 aabb ccdd eeff 0123456789abcdef
00000000: 3030 3030 3030 3139 3034 0941 4252 4148 0000001904.ABRAH
00000010: 414d 2053 4543 5552 4954 4945 5320 434f AM SECURITIES CO
00000020: 5250 4f52 4154 494f 4e09 3030 3832 3934 RPORATION.008294
00000030: 3532 0933 3732 3420 3437 5448 2053 5452 52.3724 47TH STR
00000040: 4545 5420 4354 2e20 4e57 0920 0947 4947 EET CT. NW. .GIG
00000050: 2048 4152 424f 5209 5741 0939 3833 3335 HARBOR.WA.98335
00000060: 090d 0d0a 3030 3030 3030 3233 3033 0950 ....0000002303.P
^^^^^^^^^
Hex 09 0d 0d 0a is \t\r\r\n.
Alternatively, you can print the lines with p and any invisible characters will be revealed.
f = File.open(file_name)
p f.readline
"0000001904\tABRAHAM SECURITIES CORPORATION\t00829452\t3724 47TH STREET CT. NW\t \tGIG HARBOR\tWA\t98335\t\r\r\n"
Use :row_sep => :auto instead of :row_sep => "\n\r":
CSV.open(file_name, "r", { :col_sep => "\t", :row_sep => :auto }).each do |row|
puts row
end

SWI prolog, char_type, ascii / alnum, why so many chars? how to fix it?

I've just wanted to check, what chars SWI-prolog treats as 'alnum'.
My question clause was:
findall(X,char_type(X,alnum),Lalnum),length(Lalnum,N).
and the SWI's answer:
Lalnum = ['0', '1', '2', '3', '4', '5', '6', '7', '8'|...],
N = 816459.
I was very surprised - why so many?
Then I've decided to check pure 'ascii' set - after all, according to the doc page:
http://www.swi-prolog.org/pldoc/doc_for?object=char_type/2
there are only 128 of them (7 bit char set).
My obvious question was:
findall(X,char_type(X,ascii),Lascii),length(Lascii,N).
and the SWI's answer:
Lascii = ['\000\', '\001\', '\002\', '\003\', '\004\',
'\005\', '\006\', '\a', '\b'|...],
N = 2176.
I was surprised even more than before...
What is wrong? Where is the problem?
With my question? With my SWI-prolog installation? With my system?
It is:
SWI Prolog 7.7.13, with ascii encoding:
current_prolog_flag(encoding,X).
X = ascii.
Win 8.1 64bit, with code page 852.
And how to fix it?
Thank you in advance
EDIT:
probably I've found the answer to my second question: 'how to fix it'.
It seems, that additional clause:
sort(Lascii,SortedLascii)
removes repetitions and leaves the basic set of 128 chars alone.
but I still do not understand why the first clause generates so many results...???
The reason for so many characters is Unicode. It'll return all relevant characters depending on your current locale.
Including Unicode:
Letters only:
?- :(C, char_type(C, alpha), L), length(L, Len).
L = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'|...],
Len = 2568.
Alphanumeric characters:
?- findall(C, char_type(C, alnum), L), length(L, Len).
L = ['0', '1', '2', '3', '4', '5', '6', '7', '8'|...],
Len = 2578.
Now ASCII only:
Letters only:
?- findall(C, (char_type(C, alpha), char_type(C, ascii)), L), length(L, Len).
L = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'|...],
Len = 52.
Alphanumerics:
?- findall(C, (char_type(C, alnum), char_type(C, ascii)), L), length(L, Len).
L = ['0', '1', '2', '3', '4', '5', '6', '7', '8'|...],
Len = 62.
What's causing the confusion?
Because number of returned items is too high, the output is cut and omitted items are replaced with ellipsis. More details here:
https://www.swi-prolog.org/FAQ/AllOutput.html
To change this behavior and see a complete output use the following config option:
set_prolog_flag(
answer_write_options,
[
quoted(true),
portray(true),
spacing(next_argument)
]
),
This way you'll see all Unicode characters and won't be confused any more.
Note that the only difference from default is absence of max_depth(10).

ZeroMQ dealer socket doesn't work under Elixir (using Erlang's chumak)

I want to communicate between Elixir and Python. I don't want to use NIFs and stuff - I prefer loosely coupled using zeroMQ as this will allow me to use other languages than Python later. I am using the chumak library which is a native implementation of zeromq in Erlang, and seems well maintained. I have used it successfully in the past for pub sub.
Apart from pub-sub, I'm finding that req-rep and req-router sockets work fine. However dealer-router does not. This is really important because only dealer and router give you true async in zeromq.
Here is the python code for the router side:
import zmq
context = zmq.Context()
rout = context.socket(zmq.ROUTER)
rout.bind("tcp://192.168.1.192:8760")
Here is the Elixir req code which works fine...
iex(1)> {ok, sock1} = :chumak.socket(:req, 'reqid')
{:ok, #PID<0.162.0>}
iex(2)> {ok, _peer} = :chumak.connect(sock1, :tcp, '192.168.1.192', 8760)
{:ok, #PID<0.164.0>}
iex(3)> :chumak.send(sock1, 'hello from req socket')
:ok
.... because I get it on the Python side:
In [5]: xx = rout.recv_multipart()
In [6]: xx
Out[6]: ['reqid', '', 'hello from req socket']
However, here is what I get if I try a dealer socket on the Elixir side:
iex(4)> {ok, sock2} = :chumak.socket(:dealer, 'dealid')
{:ok, #PID<0.170.0>}
iex(5)> {ok, _peer} = :chumak.connect(sock2, :tcp, '192.168.1.192', 8760)
{:ok, #PID<0.172.0>}
iex(6)> :chumak.send(sock2, 'hello from dealer socket')
{:error, :not_implemented_yet}
iex(7)> :chumak.send_multipart(sock2, ['a', 'b', 'hello from dealer socket'])
22:13:38.705 [error] GenServer #PID<0.172.0> terminating
** (FunctionClauseError) no function clause matching in :chumak_protocol.encode_more_message/3
(chumak) /home/tbrowne/code/elixir/chutest/deps/chumak/src/chumak_protocol.erl:676: :chumak_protocol.encode_more_message('a', :null, %{})
(stdlib) lists.erl:1354: :lists.mapfoldl/3
(chumak) /home/tbrowne/code/elixir/chutest/deps/chumak/src/chumak_protocol.erl:664: :chumak_protocol.encode_message_multipart/3
(chumak) /home/tbrowne/code/elixir/chutest/deps/chumak/src/chumak_peer.erl:159: :chumak_peer.handle_cast/2
(stdlib) gen_server.erl:616: :gen_server.try_dispatch/4
(stdlib) gen_server.erl:686: :gen_server.handle_msg/6
(stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: {:"$gen_cast", {:send, ['a', 'b', 'hello from dealer socket'], {#PID<0.160.0>, #Reference<0.79795089.2401763329.172383>}}}
State: {:state, :ready, '192.168.1.192', 8760, :client, [], :dealer, 'dealid', [], {3, 0}, #Port<0.4968>, {:decoder, :ready, 0, nil, nil, {:some, 3}, {:some, 0}, %{}, :null, false}, #PID<0.170.0>, {[], []}, [], false, false, false, :null, %{}}
22:13:38.710 [info] [:unhandled_handle_info, {:module, :chumak_socket}, {:msg, {:EXIT, #PID<0.172.0>, {:function_clause, [{:chumak_protocol, :encode_more_message, ['a', :null, %{}], [file: '/home/tbrowne/code/elixir/chutest/deps/chumak/src/chumak_protocol.erl', line: 676]}, {:lists, :mapfoldl, 3, [file: 'lists.erl', line: 1354]}, {:chumak_protocol, :encode_message_multipart, 3, [file: '/home/tbrowne/code/elixir/chutest/deps/chumak/src/chumak_protocol.erl', line: 664]}, {:chumak_peer, :handle_cast, 2, [file: '/home/tbrowne/code/elixir/chutest/deps/chumak/src/chumak_peer.erl', line: 159]}, {:gen_server, :try_dispatch, 4, [file: 'gen_server.erl', line: 616]}, {:gen_server, :handle_msg, 6, [file: 'gen_server.erl', line: 686]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 247]}]}}}]
As you can see I get this huge error on the :chumak.send_multipart, while :chumak.send doesn't even work. What's going on here?
The dealer socket works fine by the way from the Python side:
import zmq
context = zmq.Context()
deal = context.socket(zmq.DEALER)
deal.setsockopt_string(zmq.IDENTITY, u"Thomas")
deal.connect("tcp://192.168.1.192:8760")
deal.send("hello from python deal")
Now on router side:
In [5]: xx = rout.recv_multipart()
In [6]: xx
Out[6]: ['reqid', '', 'hello from req socket']
In [7]: dd = rout.recv_multipart()
In [8]: dd
Out[8]: ['Thomas', 'hello from python deal']
So I'm wondering if I have a syntax, or type error, in my Elixir chumak dealer socket, or if it's simply a bug. I have tried this on both amd64 and armv7l architectures and the problem is identical.
All the elixir code is based on the Erlang version in the chumak example for dealer-router.
My mix.exs deps looks like this:
[
{:chumak, "~> 1.2"},
{:msgpack, "~> 0.7.0"}
]
the only obvious thing I see is your use of send_multipart. Its signature in the source:
-spec send_multipart(SocketPid::pid(), [Data::binary()]) -> ok.
you are doing this:
:chumak.send_multipart(sock2, ['a', 'b', 'hello from dealer socket'])
------------
iex(2)> is_binary('a')
false
iex(3)> is_binary('hello from dealer socket')
false
Otherwise, I can not see much of a difference between your code and the example code that is in chumak's repo.

How to Parse with Commas in CSV file in Ruby

I am parsing the CSV file with Ruby and am having trouble in that the delimiter is a comma my data contains commas.
In portions of the data that contain commas the data is surrounded by "" but I am not sure how to make CSV ignore commas that are contained within Quotations.
Example CSV Data (File.csv)
NCB 14591 BLK 13 LOT W IRR," 84.07 FT OF 25, ALL OF 26,",TWENTY-THREE SAC HOLDING COR
Example Code:
require 'csv'
CSV.foreach("File.csv", encoding:'iso-8859-1:utf-8', :quote_char => "\x00").each do |x|
puts x[1]
end
Current Output: " 84.07 FT OF 25
Expected Output: 84.07 FT OF 25, ALL OF 26,
Link to the gist to view the example file and code.
https://gist.github.com/markscoin/0d6c2d346d70fd627203317c5fe3097c
Try with force_quotes option:
require 'csv'
CSV.foreach("data.csv", encoding:'iso-8859-1:utf-8', quote_char: '"', force_quotes: true).each do |x|
puts x[1]
end
Result:
84.07 FT OF 25, ALL OF 26,
The illegal quoting error is when a line has quotes, but they don't wrap the entire column, so for instance if you had a CSV that looks like:
NCB 14591 BLK 13 LOT W IRR," 84.07 FT OF 25, ALL OF 26,",TWENTY-THREE SAC HOLDING COR
NCB 14592 BLK 14 LOT W IRR,84.07 FT OF "25",TWENTY-FOUR SAC HOLDING COR
You could parse each line individually and change the quote character only for the lines that use bad quoting:
require 'csv'
def parse_file(file_name)
File.foreach(file_name) do |line|
parse_line(line) do |x|
puts x.inspect
end
end
end
def parse_line(line)
options = { encoding:'iso-8859-1:utf-8' }
begin
yield CSV.parse_line(line, options)
rescue CSV::MalformedCSVError
# this line is misusing quotes, change the quote character and try again
options.merge! quote_char: "\x00"
retry
end
end
parse_file('./File.csv')
and running this gives you:
["NCB 14591 BLK 13 LOT W IRR", " 84.07 FT OF 25, ALL OF 26,", "TWENTY-THREE SAC HOLDING COR"]
["NCB 14592 BLK 14 LOT W IRR", "84.07 FT OF \"25\"", "TWENTY-FOUR SAC HOLDING COR"]
but then if you have a mix of bad quoting and good quoting in a single row this falls apart again. Ideally you just want to clean up the CSV to be valid.

Resources