python xlrd: convert xls to csv using tempfiles. Tempfile is empty - parsing

I am downloading an xls file from the internet. It is in .xls format but I need 'Sheet1' to be in csv format. I use xlrd to make the conversion but seem to have run into an issue where the file I write to is empty?
import urllib2
import tempfile
import csv
import xlrd
url_2_fetch = ____
u = urllib2.urlopen(url_2_fetch)
wb = xlrd.open_workbook(file_contents=u.read())
sh = wb.sheet_by_name('Sheet1')
csv_temp_file = tempfile.TemporaryFile()
with open('csv_temp_file', 'wb') as f:
writer = csv.writer(f)
for rownum in xrange(sh.nrows):
writer.writerow(sh.row_values(rownum))
That seemed to have worked. But now I want to inspect the values by doing the following:
with open('csv_temp_file', 'rb') as z:
reader = csv.reader(z)
for row in reader:
print row
But I get nothing:
>>> with open('csv_temp_file', 'rb') as z:
... reader = csv.reader(z)
... for row in reader:
... print row
...
>>>
I am using a tempfile because I want to do more parsing of the content and then use SQLAlchemy to store the contents of the csv post more parsing to a mySQL database.
I appreciate the help. Thank you.

This is completely wrong:
csv_temp_file = tempfile.TemporaryFile()
with open('csv_temp_file', 'wb') as f:
writer = csv.writer(f)
The tempfile.TemporaryFile() call returns "a file-like object that can be used as a temporary storage area. The file will be destroyed as soon as it is closed (including an implicit close when the object is garbage collected)."
So your variable csv_temp_file contains a file object, already open, that you can read and write to, and will be deleted as soon as you call .close() on it, overwrite the variable, or cleanly exit the program.
So far so good. But then you proceed to open another file with open('csv_temp_file', 'wb') that is not a temporary file, is created in the script's current directory with the fixed name 'csv_temp_file', is overwritten every time this script is run, can cause security holes, strange bugs and race conditions, and is not related to the variable csv_temp_file in any way.
You should trash the with open statement and use the csv_temp_file variable you already have. You can try to .seek(0) on it before using it again with the csv reader, it should work. Call .close() on it when you are done with it and the temporary file will be deleted.

Related

Using io.tmpfile() with shell command, ran via io.popen, in Lua?

I'm using Lua in Scite on Windows, but hopefully this is a general Lua question.
Let's say I want to write a temporary string content to a temporary file in Lua - which I want to be eventually read by another program, - and I tried using io.tmpfile():
mytmpfile = assert( io.tmpfile() )
mytmpfile:write( MYTMPTEXT )
mytmpfile:seek("set", 0) -- back to start
print("mytmpfile" .. mytmpfile .. "<<<")
mytmpfile:close()
I like io.tmpfile() because it is noted in https://www.lua.org/pil/21.3.html :
The tmpfile function returns a handle for a temporary file, open in read/write mode. That file is automatically removed (deleted) when your program ends.
However, when I try to print mytmpfile, I get:
C:\Users\ME/sciteLuaFunctions.lua:956: attempt to concatenate a FILE* value (global 'mytmpfile')
>Lua: error occurred while processing command
I got the explanation for that here Re: path for io.tmpfile() ?:
how do I get the path used to generate the temp file created by io.tmpfile()
You can't. The whole point of tmpfile is to give you a file handle without
giving you the file name to avoid race conditions.
And indeed, on some OSes, the file has no name.
So, it will not be possible for me to use the filename of the tmpfile in a command line that should be ran by the OS, as in:
f = io.popen("python myprog.py " .. mytmpfile)
So my questions are:
Would it be somehow possible to specify this tmpfile file handle as the input argument for the externally ran program/script, say in io.popen - instead of using the (non-existing) tmpfile filename?
If above is not possible, what is the next best option (in terms of not having to maintain it, i.e. not having to remember to delete the file) for opening a temporary file in Lua?
You can get a temp filename with os.tmpname.
local n = os.tmpname()
local f = io.open(n, 'w+b')
f:write(....)
f:close()
os.remove(n)
If your purpose is sending some data to a python script, you can also use 'w' mode in popen.
--lua
local f = io.popen(prog, 'w')
f:write(....)
#python
import sys
data = sys.stdin.readline()

Why the lua function io.write() did not work. It only display the results on the terminal, rather than writing to a file

I am learning the Lua IO library. I'm having trouble with io.write(). In Programming Design in Lua, there is a piece of code that iterates through the file line by line and precedes each line with a serial number.
This is the file I`m working on:
test file: "iotest.txt"
This is my code
io.input("iotest.txt")
-- io.output("iotest.txt")
local count = 0
for line in io.lines() do
count=count+1
io.write(string.format("%6d ",count), line, "\n")
end
This is the result of the terminal display, but this result cannot be written to the file, whether I add IO. Output (" iotest.txt ") or not.
the results in terminal
This is the result of file, we can see there is no change
The result after code running
Just add io.flush() after your write operations to save the data to the file.
io.input("iotest.txt")
io.output("iotestout.txt")
local count = 0
for line in io.lines() do
count=count+1
io.write(string.format("%6d ",count), line, "\n")
end
io.flush()
io.close()
Refer to Lua 5.4 Reference Manual : 6.8 - Input and Output Facilities
io.flush() will save any written data to the output file which you set with io.output
See koyaanisqatsi's answer for the optional use of file handles. This becomes especially useful if you're working on multiple files at a time and gives you more control on how to interact with the file.
That said you should also have different files for input and output. You'll agree that it doesn't make sense to read and write from and to the same file alternatingly.
For writing to a file you need a file handle.
This handle comes from: io.open()
See: https://www.lua.org/manual/5.4/manual.html#6.8
A file handle has methods that acts on self.
Thats the function after the : at file handle.
So io.write() puts out on stdout and file:write() in a file.
Example function that can dump a defined function to a file...
fdump=function(func,path)
assert(type(func)=="function")
assert(type(path)=="string")
-- Get the file handle (file)
local file,err = io.open(path, "wb")
assert(file, err)
local chunk = string.dump(func,true)
file:write(chunk)
file:flush()
file:close()
return 'DONE'
end
Here are the methods, taken from io.stdin
close = function: 0x566032b0
seek = function: 0x566045f0
flush = function: 0x56603d10
setvbuf = function: 0x56604240
write = function: 0x56603e70
lines = function: 0x566040c0
read = function: 0x56603c90
This makes it able to use it directly like...
( Lua console: lua -i )
> do io.stdout:write('Input: ') local result=io.stdin:read() return result end
Input: d
d
You are trying to open the same file for reading and writing at the same time. You cannot do that.
There are two possible solutions:
Read from file X, iterate through it and write the result to another file Y.
Read the complete file X into memory, close file X, then delete file X, open the same filename for writing and write to it while iterating through the original file (in memory).
Otherwise, your approach is correct although file operations in Lua are more often done using io.open() and file handles instead of io.write() and io.read().

Nifi: How to concatenate flowfile to already existing tables in a directory?

This is a question about Nifi.
I made Nifi pipeline to convert flowfile with xml format to csv format.
Now, I would like to concatenate or union the converted csv flowfile to existing tables by filename (which stands for table name as well).
Simply put, my processor flow is following.
GetFile (from a particular directory) -> 2. Convert xml to csv -> 3.Update the flowfile with table name
-> 4. PutFile (to a different directory)
But, at the end of the flow, PutFile processor throws an error, saying "file with the same name already exists".
I have no ideas how flowfile can be added to existing csv table.
Any advice, tips, ideas are appreciated.
Thank you in advance.
there is no support to append file however you could use ExecuteGroovyScript to do it:
def ff=session.get()
if(!ff)return
ff.read().withStream{s->
String path = "./out_folder/${ff.filename}"
//sync on file path to avoid conflict on same file writing (hope)
synchronized(path){
new File( path ).append(s)
}
}
REL_SUCCESS << ff
if you need to work with text (reader) content rather then byte (stream) content
the following example shows how to exclude 1 header line from flow file if destination file already exists
def ff=session.get()
if(!ff)return
ff.read().withReader("UTF-8"){r->
String path = "./.data/${ff.filename}"
//sync on file path to avoid conflict on same file writing (hope)
synchronized(path){
def fout = new File( path )
if(fout.exists())r.readLine() //skip 1 line (header) only if out file already exists
fout.append(r) //append to the file the rest of reader content
}
}
REL_SUCCESS << ff

Apache Beam / GCP Dataflow encoding issue

i am "playing" with apache beam/dataflow in datalab.
I am trying to read a csv file from gcs.
when i create the pcollection using:
lines = p | 'ReadMyFile' >> beam.io.ReadFromText('gs://' + BUCKET_NAME + '/' + input_file, coder='StrUtf8Coder')
I get the following error:
LookupError: unknown encoding: "THE","NAME","OF","COLUMNS"
it seems the name of columns is interpreted as encoding?
I do not understand what's wrong.
If i do not specify the "coder" i get
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 1045: invalid continuation byte
Outside apache beam I am able to handle this error by reading the file from gcs:
blob = storage.Blob(gs_path, bucket)
data = blob.download_as_string()
data.decode('utf-8', 'ignore')
I read apache beam only support utf8 and the file does not contain only utf8.
Should I download and then convert to pcollection?
Any suggestion?
A possible hack is to create a class that inherits from the Coder class (apache_beam.coders.coders.Coder)
from apache_beam.coders.coders import Coder
class ISOCoder(Coder):
"""A coder used for reading and writing strings as ISO-8859-1."""
def encode(self, value):
return value.encode('iso-8859-1')
def decode(self, value):
return value.decode('iso-8859-1')
def is_deterministic(self):
return True
and pass it as an argument to the ReadFromText IO transform (apache_beam.io.textio.ReadFromText) provided by beam
like this
from apache_beam.io import ReadFromText
with beam.Pipeline(options=pipeline_options) as p:
new_pcollection = ( p | 'Read From GCS' >>
beam.io.ReadFromText('input_file', coder=ISOCoder())
The logic behind this detailed here
https://medium.com/#khushboo_16578/cloud-dataflow-and-iso-8859-1-2bb8763cc7c8
I would suggest changing the coding on the actual file. If you save the file with "Save as" you can select UTF-8 encoding for the format on excel CSVs and regular .txt. Once you do that you need to make sure you add a line of code like
class DoWork(beam.DoFn):
def process(self, text):
text = textfilePcollection.encode('utf-8')
Do other stuff
This isn't how I would like to do it because it isn't code-centric, but it has work for me before. Unfortunately, I don't have a code-centric solution.

reading large csv files in a rails app takes up a lot of memory - Strategy to reduce memory consumption?

I have a rails app which allows users to upload csv files and schedule the reading of multiple csv files with help of delayed_job gem. The problem is the app reads each file in its entirity into memory and then writes to the database. If its just 1 file being read its fine, but when multiple files are read the RAM on the server gets full and causes the app to hang.
I am trying to find a solution for this problem.
One solution I researched is to break the csv file into smaller parts and save them on the server, and read the smaller files. see this link
example: split -b 40k myfile segment
Not my preferred solution. Are there any other approaches to solve this where I dont have to break the file. Solutions must be ruby code.
Thanks,
You can make use of CSV.foreach to read just chunks of your CSV file:
path = Rails.root.join('data/uploads/.../upload.csv') # or, whatever
CSV.foreach(path) do |row|
# process row[i] here
end
If it's run in a background job, you could additionally call GC.start every n rows.
How it works
CSV.foreach operates on an IO stream, as you can see here:
def IO.foreach(path, options = Hash.new, &block)
# ...
open(path, options) do |csv|
csv.each(&block)
end
end
The csv.each part is a call to IO#each, which reads the file line by line (rb_io_getline_1 invokation) and leaves the line read to be garbage collected:
static VALUE
rb_io_each_line(int argc, VALUE *argv, VALUE io)
{
// ...
while (!NIL_P(str = rb_io_getline_1(rs, limit, io))) {
rb_yield(str);
}
// ...
}

Resources