Double reading large CSV file [closed] - parsing

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I have a large CSV that will not completely fit in memory, and I need to do a lot of work on it. I'm new to lazy sequences, don't know how to solve this. I'm trying to read the whole file into memory, then parse it, I know that's wrong.
Here's what I'm trying to do:
Read the header row and do things based on that. It's used throughout the program.
Read all the rows and gather summary data on each column.
Use the summary data to transform the original data and write a new file.
Is there a way to read in the header row and use it constantly without leading to the "holding onto the head" issue with lazy sequences, keeping the whole thing in memory?
I found this related thread: using clojure-csv.core to parse a huge csv file

Clojure takes care of clearing local bindings, so once a binding is no longer going to be used, it will be nulled to make it elegible for GC. So your code could look something like:
(defn gather-summary [file]
(with-open [rdr (io/reader file)]
(let [lines (csv/read-csv rdr)
header (first lines)]
(reduce (fn [so-far row]
(if header
(inc so-far)
(dec so-far)))
0
(rest lines))))
(defn modify [summary file]
;similar to gather
)
(defn process [file]
(let [summary (gather-summary file)]
(modify summary file)))
header doesn't hold the head because it just has the first element, which doesn't have any ref to the rest of the lines.
lines is not used after the (rest lines) fn call, so Clojure will clear it.
reduce works on a recursive fashion, so Clojure also takes of not holding the head in that case

Related

Plsase! I need anyone that can, decode “Luraph Obfuscator” [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I paid an untrusted developer for a script. And as I thought he scammed me. He did send me code, but he obfuscated the script.
https://paste bin.com/Y9rn2Gdr
Every instruction is separated in functions, therefore the code cant be directly deobfuscated without specific details about its functionality.
This code consists of:
A string that contains the source of the script
Some bytes of the string represents an offset of that character in the ASCII table, while others represent functions and loop-paradigms like for and while (note that these are separated in different functions within the interpreter)
An iterator function (interpreter) that goes through every character in the string and calls for other functions in order to find the accurate action to perform based in the character.
The code that is outside the string is an interpreter, for deobfuscating the interpreter I suggest the following:
Take care of variable names, every variable in the interpreter has to be defined before, therefore you can tell by context what's the usage of that variable
Solve the #{4093, 2039, 2140, 1294} tables by simply calculating the length (just like # operator does), that is, the result for that last table is 4
You need a pretty printer that will apply indentation and format to the code, making it more readable
A pseudocode of the reader looks like this (I assume this is also nested within other functions of the interpreter):
-- ReadBytes is the main function that holds the interpreter and other functions
local function ReadBytes(currentCharacter)
local repeatOffset
currentCharacter =
string_gsub(
string_sub(currentCharacter, 5),
"..",
function(digit)
if string.sub(digit, 2) == 'H' then
repeatOffset = tonumber(string_sub(digit, 1, 1))
return ""
else
local char = string_char(tonumber(digit, 16))
if repeatOffset then
local repeatOutput = string_rep(char, repeatOffset)
repeatOffset = nil
return repeatOutput
else
return char
end
end
end
)
. . . -- Other nested functions
end
I have trouble understanding the functionality of the encoded string, however, from this question, this seems to be a ROBLOX script, is that correct?
If that's the case, I recommend you debugging the code within ROBLOX environment to understand the core functionality of the code and rewrite a readable alternative that works just like the original.
You can also deobfuscate the interpreter to understand how it works, then capture the interpreter actions in order to see the workflow of it, then write a Lua script that works exactly like the original and does not require the interpreter.

Why do we check if empty before we move? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I see a lot of codes like this:
if WA > SPACES
move WA to WA2
end-if
What are the advantages that this is checked? Is it more efficient to check if it is empty than to just move it anyways?
Additional information:
WA and WA2 can be simple structures (without fillers) but also just simple attributes. They are not redifined and typed as chars or structures of chars. They can be either low-values (semantically NULL) or have alphanumeric content.
Nobody can tell you what the actual reason is except for the people who coded it, but here is a very probable reason:
Usually this is accompanied with an ELSE that would cover what happens if the value is less than spaces, but in this case I would assume that what every happens to this data later is relying on that field NOT being LOW-VALUES or some funky non-displayable control character.
If I had to guess, I would assume that WA2 is initialized to spaces. So doing this check before the move ensures that nothing lower than spaces would be moved to that variable. Remember, less than spaces does not mean empty, it just means that hex values of that string are less than X'40' (so for example is the string was full of low values, it would be all X'00'. So I would guess that its more about ensuring that the data is valid than efficiency.

is one hot encoding is free of the dummy trap [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
there is a thing called dummy trap in one hot encoder that is when we encode the categorical column with 3 categories lest say a,b,and c then with one hot encoder we get 3 categories like or columns a, b ,and c but when we use get_dummies we get 2 columns instead a, and b then it is save from dummy trap. is one hot encoding exposed to dummy trap or it takes care of it . am i right? which one is save of dummy trap? or is it ok to use both with our removing columns, iam using the dataset for many algorithms.
looking for help . thanks in advance.
OneHotEncoder cannot process string values directly. If your nominal features are strings, then you need to first map them into integers.
pandas.get_dummies is kind of the opposite. By default, it only converts string columns into one-hot representation, unless columns are specified.

How to find the n most frequent words in a PDF file on Ubuntu? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have various research papers (nearly 150) which are PDF files. I have to find the n most frequent words in these files.
These PDF files have figures and mathematical formulas also. I know how to do it for a single text file with only words. I want to write a script which parses all 150 PDF files and then returns list of n most frequent words in these files.
I want a method to parse complicated PDF files (with words,figures and formulas)
Then I want to write a script which parses all files in the specific location on my PC and return a list of n most frequent words in all the PDF files combined.
1) parse PDF files with CAM::PDF
2) use split() in perl like (spaces or tabs) this (for each pdf and each lines inside) to get every words :
$words{$_}++ for split /\s+/, $line;
3) at the end, sort (or iter and test each values) by numerical values of %words and get the 1th element

output routes and avoid the routes which user input in prolog [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
output prolog list path and avoid certain routes which user input in prolog.
hi ,I'm working on a project , a building contains zones , each zone has an exit , we want to evacuate people through the zones to exits ,the user input two parameters ,the first one is the "infected zone" ,the other parameter is "zone of people we want to evacuate".
the output should be all the safe routes from the "zone of people we want to evacuate" to exits avoiding the infected zone.
for example :
user input (z11, z12) // it means z11 is infected , people we want to evacuate is in z12.
output : z12->z22->exit3. and
z12->z21->exit2. and z12->elevators
the facts are :
path(z11,z12).
path(z12,z11).
path(z12,z22).
path(z12,z21).
path(z22,z12).
path(z22,z21).
path(z21,z22).
path(z11,exit1).
path(z12,elevators).
path(z21,exit2).
path(z22,exit3).
please help me writing the code.
It's inconvenient that you've chosen to name your predicate path/2 since we'd probably want to call the thing that generates a path to the exit with that name. So first I'd rename all your facts from path/2 to connected/2. Then you're going to want to annotate the exits:
exit(exit1). exit(exit2).
exit(elevators).
Otherwise you'd have to hard-code them somewhere else.
A simple thing to do would be to solve the general path question and then check to ensure the path doesn't contain an infected site. That would look like this:
path(Start, Path) :- path(Start, Path, []).
path(Start, [Exit], Seen) :-
exit(Exit),
connected(Start, Exit),
\+ memberchk(Exit, Seen).
path(Start, [Next|Rest], Seen) :-
connected(Start, Next),
\+ memberchk(Next, Seen),
path(Next, Rest, [Next|Seen]).
safe_path(Start, Avoid, Path) :-
path(Start, Path),
\+ memberchk(Avoid, Path).
This easily generalizes to handle sets of avoid zones:
safe_path(Start, AvoidList, Path) :-
path(Start, Path),
forall(member(Avoid, AvoidList), \+ memberchk(Avoid, Path)).
The bulk of what's interesting and fun to do in Prolog is accomplished with a generate/test paradigm. The simplest and most direct formulation is usually one in which you generate too much (too generally, you might say) and put all the restrictions in the test. Generally speaking, you achieve better performance by making the generator more intelligent about generating possibilities--moving code from the "test" part into the "generate" part of "generate and test."
Usually the first problem you face is generating an infinite tree. This is particularly true with graphs. The memberchk/2 in path/3 with the Seen list serves to prevent looping back and is necessary to make the set of paths finite. Using exit/1 in the base case of path/3 also helps performance because we're not generating intermediate paths. It's nice that with your particular situation you can get away with this.
Doing the avoidance at the end is winnowing out chaff last. The generation doesn't know to avoid these nodes so all of the poisoned paths will get generated and removed by the test. If performance isn't sufficient this way, you can move that code into path/2 directly, doing a similar kind of check to the one done with the Seen list.

Resources