Capture multiple blog posts and concatenate - parsing

I'm not sure what language fits this best (or if there's already a program for this), but here's what I basically want to do: When given a URL, I want it to go to that page, capture text between certain html tags (just one time per page), then click the "Next" button and move to the next page, (and repeat until finished). Then export the whole thing as a .pdf or something similar (a .txt could even work). It'd be useful if the program could print a horizontal rule between each post, but not required
I only need this to work once and, in fact, here's the blog I want to copy the posts from: http://www.trailjournals.com/entry.cfm?id=336394 (I basically just don't want to spend the time clicking through all of them).
I know some JavaScript, some basic regex, and some HTML along with a couple others that aren't really applicable here (and I'm a quick learner), so I'm here to learn, not just asking for someone to do something for me.
Thanks!

There's probably a better way to do it, but since I'm an engineering student (and Matlab is currently the programming language I'm best at), I decided to see if I could do it through there. And it worked.
Granted, some things could probably have been done better (I don't really know regex that well, so I used a lot of "findstr" instead).
clear;
clc;
fid=fopen('journals.txt','w');
fprintf(fid,'');
fclose('all');
fid=fopen('journals.txt','a');
id=input('Enter the starting id number: ','s');
loop=1;
while loop==1
clc;
url=strcat('http://www.trailjournals.com/entry.cfm?id=',id)
strContents=urlread(url);
f=findstr('</TABLE>',strContents);
f=f(1)+13;
l=findstr('<p>',strContents);
l=l(end)-5;
if f>l(end)
f=findstr('<blockquote>',strContents);
f=f(1)+14;
end
p=strContents(1,f:l)
if isempty(p)==1
cprintf('red','EMPTY ENTRY!\n');
return;
end
% disp(p);
% disp('------------');
% ques=input('Does this look good? (y/n): ','s');
% disp('------------');
%
% while ques=='n'
% firstword=input('Enter the first word: ','s');
% lastword=input('Enter the last word: ','s');
% f=findstr(firstword,strContents);
% l=findstr(lastword,strContents);
% p=strContents(1,f:l+length(lastword));
% disp(p);
% disp('------------');
% ques=input('Does this look good? (y/n): ','s');
% disp('------------');
% end
fprintf(fid,p);
fprintf(fid,'\n');
fprintf(fid,'\r\n\r\n-------------------------------------------\r\n\r\n');
%Next URL: next:next+6
next=findstr('">Next</a>',strContents);
if isempty(next)==1
break;
end
next=next(1);
next=next-6;
id=strContents(1,next:next+5);
url=strcat('http://www.trailjournals.com/entry.cfm?id=',id);
end
fclose('all');
cprintf('Green','The process has been completed\n');

Related

Google sheets, summing over a formula?

Apologies if this is a little unclear, or this question has been asked already. It's a little difficult to explain, but I've bolded my question - it's essentially about shortening formulas.
I'm running a payment plan waterfall model. My code works, but, it's er, you know...
IF($K2=Q$1,Assumptions!$E$56*($N2+$O2),IF(AND($K2<Q$1,$M2>Q$1),Assumptions!$E$56*$P2/$L2,0)) + IF(edate($K2,1)=Q$1,Assumptions!$F$56*($N2+$O2),IF(AND(edate($K2,1)<Q$1,edate($M2,1)>Q$1),Assumptions!$F$56*$P2/$L2,0)) + IF(edate($K2,2)=Q$1,Assumptions!$F$56*($N2+$O2),IF(AND(edate($K2,2)<Q$1,edate($M2,2)>Q$1),Assumptions!$F$56*$P2/$L2,0)) + IF(edate($K2,3)=Q$1,Assumptions!$F$56*($N2+$O2),IF(AND(edate($K2,3)<Q$1,edate($M2,3)>Q$1),Assumptions!$F$56*$P2/$L2,0)) + IF(edate($K2,4)=Q$1,Assumptions!$F$56*($N2+$O2),IF(AND(edate($K2,4)<Q$1,edate($M2,4)>Q$1),Assumptions!$F$56*$P2/$L2,0)) + IF(edate($K2,5)=Q$1,Assumptions!$F$56*($N2+$O2),IF(AND(edate($K2,5)<Q$1,edate($M2,5)>Q$1),Assumptions!$F$56*$P2/$L2,0)) + IF(edate($K2,6)=Q$1,Assumptions!$F$56*($N2+$O2),IF(AND(edate($K2,6)<Q$1,edate($M2,6)>Q$1),Assumptions!$F$56*$P2/$L2,0))
...pretty long.
Essentially what's going on is: the assumption is, when we launch a product, we sell say 80% in the first month, and 2.5% every subsequent month until we each 100%.
I'd like the 80% and the 2.5% to be variables (listed as Assumptions!$E$56 and Assumptions!$E$56) here.
Obviously a little long. But noticed after the first IF clause, the subsequent ones are actually identical, the only difference being the number inside edate(__,2), edate(__,3)...
So my question is - can this code be tidied up into some sort of for loop? Python would make it pretty simple to increment over the variable edate(__,i) and sum over i = 1:6.
Sure, there is. Usually looping is emulated by Sequence(N), which makes an array of numbers from [1,N] vertically, which is somewhat like your Python range. Then you can do stuff to it as an ArrayFormula.
In your case, you end up with two terms: your initial term using $E, and all the looped stuff using $F. I see 6 terms, so I will use Sequence(6):
=IF(
$K2=Q$1,
Assumptions!$E$56*($N2+$O2),
IF(
AND(
$K2<Q$1,
$M2>Q$1
),
Assumptions!$E$56*$P2/$L2,
0
)
) + ArrayFormula(SUM(
IF(
edate($K2,SEQUENCE(6))=Q$1,
Assumptions!$F$56*($N2+$O2),
IF(
(edate($K2,SEQUENCE(6))<Q$1)*
(edate($M2,SEQUENCE(6))>Q$1),
Assumptions!$F$56*$P2/$L2,
0
)
)
))
And if you want, you can give your Assumption values names using named ranges.

spaCy: optimizing tokenization

I'm currently trying to tokenize a text file where each line is the body text of a tweet:
"According to data reported to FINRA, short volume percent for $SALT clocked in at 39.19% on 12-29-17 http://www.volumebot.com/?s=SALT"
"#Good2go #krueb The chart I posted definitely supports ng going lower. Gobstopper' 2.12, might even be conservative."
"#Crypt0Fortune Its not dumping as bad as it used to...."
"$XVG.X LOL. Someone just triggered a cascade of stop-loss orders and scooped up morons' coins. Oldest trick in the stock trader's book."
The file is 59,397 lines long (a day's worth of data) and I'm using spaCy for pre-processing/tokenization. It's currently taking me around 8.5 minutes and I was wondering if there were any way of optimising the following code to be quicker as 8.5 minutes seems awfully long for this process:
def token_loop(path):
store = []
files = [f for f in listdir(path) if isfile(join(path, f))]
start_time = time.monotonic()
for filename in files:
with open("./data/"+filename) as f:
for line in f:
tokens = nlp(line.lower())
tokens = [token.lemma_ for token in tokens if not token.orth_.isspace() and token.is_alpha and not token.is_stop and len(token.orth_) != 1]
store.append(tokens)
end_time = time.monotonic()
print("Time taken to tokenize:",timedelta(seconds=end_time - start_time))
return store
Although it says files, it's currently only looping over 1 file.
Just to note, I only need this to tokenize the content; I don't need any extra tagging etc.
It sounds like you haven't optimised the pipeline yet. You'll get a significant speed up from disabling the pipeline components you don't need, like so:
nlp = spacy.load('en', disable=['parser', 'tagger', 'ner'])
This should get you down to about the two-minute mark, or better, on its own.
If you need a further speed up, you can look at multi-threading using nlp.pipe. Docs for multi-threading are here:
https://spacy.io/usage/processing-pipelines#section-multithreading
You can use nlp.pipe(all_lines) instead of nlp(line) for a faster processing
see Spacy's documentation - https://spacy.io/usage/processing-pipelines

Project Euler #3 Ruby Solution - What is wrong with my code?

This is my code:
def is_prime(i)
j = 2
while j < i do
if i % j == 0
return false
end
j += 1
end
true
end
i = (600851475143 / 2)
while i >= 0 do
if (600851475143 % i == 0) && (is_prime(i) == true)
largest_prime = i
break
end
i -= 1
end
puts largest_prime
Why is it not returning anything? Is it too large of a calculation going through all the numbers? Is there a simple way of doing it without utilizing the Ruby prime library(defeats the purpose)?
All the solutions I found online were too advanced for me, does anyone have a solution that a beginner would be able to understand?
"premature optimization is (the root of all) evil". :)
Here you go right away for the (1) biggest, (2) prime, factor. How about finding all the factors, prime or not, and then taking the last (biggest) of them that is prime. When we solve that, we can start optimizing it.
A factor a of a number n is such that there exists some b (we assume a <= b to avoid duplication) that a * b = n. But that means that for a <= b it will also be a*a <= a*b == n.
So, for each b = n/2, n/2-1, ... the potential corresponding factor is known automatically as a = n / b, there's no need to test a for divisibility at all ... and perhaps you can figure out which of as don't have to be tested for primality as well.
Lastly, if p is the smallest prime factor of n, then the prime factors of n are p and all the prime factors of n / p. Right?
Now you can complete the task.
update: you can find more discussion and a pseudocode of sorts here. Also, search for "600851475143" here on Stack Overflow.
I'll address not so much the answer, but how YOU can pursue the answer.
The most elegant troubleshooting approach is to use a debugger to get insight as to what is actually happening: How do I debug Ruby scripts?
That said, I rarely use a debugger -- I just stick in puts here and there to see what's going on.
Start with adding puts "testing #{i}" as the first line inside the loop. While the screen I/O will be a million times slower than a silent calculation, it will at least give you confidence that it's doing what you think it's doing, and perhaps some insight into how long the whole problem will take. Or it may reveal an error, such as the counter not changing, incrementing in the wrong direction, overshooting the break conditional, etc. Basic sanity check stuff.
If that doesn't set off a lightbulb, go deeper and puts inside the if statement. No revelations yet? Next puts inside is_prime(), then inside is_prime()'s loop. You get the idea.
Also, there's no reason in the world to start with 600851475143 during development! 17, 51, 100 and 1024 will work just as well. (And don't forget edge cases like 0, 1, 2, -1 and such, just for fun.) These will all complete before your finger is off the enter key -- or demonstrate that your algorithm truly never returns and send you back to the drawing board.
Use these two approaches and I'm sure you'll find your answers in a minute or two. Good luck!
Do you know you can solve this with one line of code in Ruby?
Prime.prime_division(600851475143).flatten.max
=> 6857

splitting space delimited entries into new columns in R

I am coding a survey that outputs a .csv file. Within this csv I have some entries that are space delimited, which represent multi-select questions (e.g. questions with more than one response). In the end I want to parse these space delimited entries into their own columns and create headers for them so i know where they came from.
For example I may start with this (note that the multiselect columns have an _M after them):
Q1, Q2_M, Q3, Q4_M
6, 1 2 88, 3, 3 5 99
6, , 3, 1 2
and I want to go to this:
Q1, Q2_M_1, Q2_M_2, Q2_M_88, Q3, Q4_M_1, Q4_M_2, Q4_M_3, Q4_M_5, Q4_M_99
6, 1, 1, 1, 3, 0, 0, 1, 1, 1
6,,,,3,1,1,0,0,0
I imagine this is a relatively common issue to deal with but I have not been able to find it in the R section. Any ideas how to do this in R after importing the .csv ? My general thoughts (which often lead to inefficient programs) are that I can:
(1) pull column numbers that have the special suffix with grep()
(2) loop through (or use an apply) each of the entries in these columns and determine the levels of responses and then create columns accordingly
(3) loop through (or use an apply) and place indicators in appropriate columns to indicate presence of selection
I appreciate any help and please let me know if this is not clear.
I agree with ran2 and aL3Xa that you probably want to change the format of your data to have a different column for each possible reponse. However, if you munging your dataset to a better format proves problematic, it is possible to do what you asked.
process_multichoice <- function(x) lapply(strsplit(x, " "), as.numeric)
q2 <- c("1 2 3 NA 4", "2 5")
processed_q2 <- process_multichoice(q2)
[[1]]
[1] 1 2 3 NA 4
[[2]]
[1] 2 5
The reason different columns for different responses are suggested is because it is still quite unpleasant trying to retrieve any statistics from the data in this form. Although you can do things like
# Number of reponses given
sapply(processed_q2, length)
#Frequency of each response
table(unlist(processed_q2), useNA = "ifany")
EDIT: One more piece of advice. Keep the code that processes your data separate from the code that analyses it. If you create any graphs, keep the code for creating them separate again. I've been down the road of mixing things together, and it isn't pretty. (Especially when you come back to the code six months later.)
I am not entirely sure what you trying to do respectively what your reasons are for coding like this. Thus my advice is more general – so just feel to clarify and I will try to give a more concrete response.
1) I say that you are coding the survey on your own, which is great because it means you have influence on your .csv file. I would NEVER use different kinds of separation in the same .csv file. Just do the naming from the very beginning, just like you suggested in the second block.
Otherwise you might geht into trouble with checkboxes for example. Let's say someone checks 3 out of 5 possible answers, the next only checks 1 (i.e. "don't know") . Now it will be much harder to create a spreadsheet (data.frame) type of results view as opposed to having an empty field (which turns out to be an NA in R) that only needs to be recoded.
2) Another important question is whether you intend to do a panel survey(i.e longitudinal study asking the same participants over and over again) . That (among many others) would be a good reason to think about saving your data to a MySQL database instead of .csv . RMySQL can connect directly to the database and access its tables and more important its VIEWS.
Views really help with survey data since you can rearrange the data in different views, conditional on many different needs.
3) Besides all the personal / opinion and experience, here's some (less biased) literature to get started:
Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology
The book is comparatively simple and leaves out panel surveys but gives a lot of R Code and examples which should be a practical start.
To prevent re-inventing the wheel you might want to check LimeSurvey, a pretty decent (not speaking of the templates :) ) tool for survey conductors. Besides I TYPO3 CMS extensions pbsurvey and ke_questionnaire (should) work well too (only tested pbsurvey).
Multiple choice items should always be coded as separate variables. That is, if you have 5 alternatives and multiple choice, you should code them as i1, i2, i3, i4, i5, i.e. each one is a binary variable (0-1). I see that you have values 3 5 99 for Q4_M variable in the first example. Does that mean that you have 99 alternatives in an item? Ouch...
First you should go on and create separate variables for each alternative in a multiple choice item. That is, do:
# note that I follow your example with Q4_M variable
dtf_ins <- as.data.frame(matrix(0, nrow = nrow(<initial dataframe>), ncol = 99))
# name vars appropriately
names(dtf_ins) <- paste("Q4_M_", 1:99, sep = "")
now you have a data.frame with 0s, so what you need to do is to get 1s in an appropriate position (this is a bit cumbersome), a function will do the job...
# first you gotta change spaces to commas and convert character variable to a numeric one
y <- paste("c(", gsub(" ", ", ", x), ")", sep = "")
z <- eval(parse(text = y))
# now you assing 1 according to indexes in z variable
dtf_ins[1, z] <- 1
And that's pretty much it... basically, you would like to reconsider creating a data.frame with _M variables, so you can write a function that does this insertion automatically. Avoid for loops!
Or, even better, create a matrix with logicals, and just do dtf[m] <- 1, where dtf is your multiple-choice data.frame, and m is matrix with logicals.
I would like to help you more on this one, but I'm recuperating after a looong night! =) Hope that I've helped a bit! =)
Thanks for all the responses. I agree with most of you that this format is kind of silly but it is what I have to work with (survey is coded and going into use next week). This is what I came up with from all the responses. I am sure this is not the most elegant or efficient way to do it but I think it should work.
colnums <- grep("_M",colnames(dat))
responses <- nrow(dat)
for (i in colnums) {
vec <- as.vector(dat[,i]) #turn into vector
b <- lapply(strsplit(vec," "),as.numeric) #split up and turn into numeric
c <- sort(unique(unlist(b))) #which values were used
newcolnames <- paste(colnames(dat[i]),"_",c,sep="") #column names
e <- matrix(nrow=responses,ncol=length(c)) #create new matrix for indicators
colnames(e) <- newcolnames
#next loop looks for responses and puts indicators in the correct places
for (i in 1:responses) {
e[i,] <- ifelse(c %in% b[[i]],1,0)
}
dat <- cbind(dat,e)
}
Suggestions for improvement are welcome.

Erlang: What is most-wrong with this trie implementation?

Over the holidays, my family loves to play Boggle. Problem is, I'm terrible at Boggle. So I did what any good programmer would do: wrote a program to play for me.
At the core of the algorithm is a simple prefix trie, where each node is a dict of references to the next letters.
This is the trie:add implementation:
add([], Trie) ->
dict:store(stop, true, Trie);
add([Ch|Rest], Trie) ->
% setdefault(Key, Default, Dict) ->
% case dict:find(Key, Dict) of
% { ok, Val } -> { Dict, Val }
% error -> { dict:new(), Default }
% end.
{ NewTrie, SubTrie } = setdefault(Ch, dict:new(), Trie),
NewSubTrie = add(Rest, SubTrie),
dict:store(Ch, NewSubTrie, NewTrie).
And you can see the rest, along with an example of how it's used (at the bottom), here:
http://gist.github.com/263513
Now, this being my first serious program in Erlang, I know there are probably a bunch of things wrong with it… But my immediate concern is that it uses 800 megabytes of RAM.
So, what am I doing most-wrong? And how might I make it a bit less-wrong?
You could implement this functionality by simply storing the words in an ets table:
% create table; add words
> ets:new(words, [named_table, set]).
> ets:insert(words, [{"zed"}]).
> ets:insert(words, [{"zebra"}]).
% check if word exists
> ets:lookup(words, "zed").
[{"zed"}]
% check if "ze" has a continuation among the words
78> ets:match(words, {"ze" ++ '$1'}).
[["d"],["bra"]]
If trie is a must, but you can live with a non-functional approach, then you can try digraphs, as Paul already suggested.
If you want to stay functional, you might save some bytes of memory by using structures using less memory, for example proplists, or records, such as -record(node, {a,b,....,x,y,z}).
I don't remember how much memory a dict takes, but let's estimate. You have 2.5e6 characters and 2e5 words. If your trie had no sharing at all, that would take 2.7e6 associations in the dicts (one for each character and each 'stop' symbol). A simple purely-functional dict representation would maybe 4 words per association -- it could be less, but I'm trying to get an upper bound. On a 64-bit machine, that'd take 8*4*2.7 million bytes, or 86 megabytes. That's only a tenth of your 800M, so something's surely wrong here.
Update: dict.erl represents dicts with a hashtable; this implies lots of overhead when you have a lot of very small dicts, as you do. I'd try changing your code to use the proplists module, which ought to match my calculations above.
An alternative way to solve the problem is going through the word list and see if the word can be constructed from the dice. That way you need very little RAM, and it might be more fun to code. (optimizing and concurrency)
Look into DAWGs. They're much more compact than tries.
I don't know about your algorithm, but if you're storing that much data, maybe you should look into using Erlang's built-in digraph library to represent your trie, instead of so many dicts.
http://www.erlang.org/doc/man/digraph.html
If all words are in English, and the case doesn't matter, all characters can be encoded by numbers from 1 to 26 (and in fact, in Erlang they are numbers from 97 to 122), reserving 0 for stop. So you can use the array module as well.

Resources