Using Context-Free Grammar To Parse Options Spread Order Strings? - parsing

I need to create a tool that reads in an options spread order in string format and spits it out in human readable format.
Examples:
Input:
BUY +6 VERTICAL LUV 100 (Weeklys) 28 AUG 20 37.5/36.5 PUT #.49 LMT
Output:
VERTICAL
BUY +6 LUV 28 AUG 20 (Weeklys) 37.5 PUT
SELL -6 LUV 28 AUG 20 (Weeklys) 36.5 PUT
.49 DEBIT LMT
Input:
BUY +1 DIAGONAL AMGN 100 (Weeklys) 4 SEP 20/28 AUG 20 245/240 CALL #.07 LMT
Output:
DIAGONAL
BUY +1 AMGN 4 SEP 20 (Weeklys) 245 CALL
SELL +1 AMGN 28 AUG 20 (Weeklys) 240 CALL
-.07 CREDIT LMT
On the surface a context-free grammar appears to be a good solution to express the various syntax (diagonal spreads are more complicated). But having almost no experience with context-free grammars I am not sure how I would carry the numbers over and also how I would for instance add the SELL orders which are not explicitly mentioned in the original order string. The SELL leg is assumed due to it being a vertical spread for example.
Hope this makes sense even if you are not an option trader ;-) The basic idea here is that translating the original string requires a bit of intelligence and is not just a matter of generating different text.
Any insights and pointers would be welcome.

It's a little hard to tell from only 2 examples, but my guess is, using a context-free grammar (especially if you have almost no experience with them) is probably overkill. The grammar itself would probably be simple enough, but you would need to either add 'actions' to transform the recognized input into the desired output, or have the parser build a syntax-tree and then write code to extract the data from the tree and generate the desired output.
It would be simpler to use regular expressions with capturing. For instance, here's some python3 code that pretty much handles your 2 examples:
import sys, re
for line in sys.stdin:
mo = re.fullmatch(r'BUY \+(\d+) (VERTICAL|DIAGONAL) (\S+) 100 \(Weeklys\) (\d+ \w+ \d+)(?:/(\d+ \w+ \d+))? ([\d.]+)/([\d.]+) (PUT|CALL) #(.\d+) LMT\n', line)
(n_units, vert_or_diag, name, date1, date2, price1, price2, put_or_call, limit) = mo.groups()
if vert_or_diag == 'VERTICAL':
assert date2 is None
date2 = date1
print()
print(vert_or_diag)
print(f"BUY +{n_units} {name} {date1} (Weeklys) {price1} {put_or_call}")
print(f"SELL -{n_units} {name} {date2} (Weeklys) {price2} {put_or_call}")
print(f"{limit} DEBIT LMT")
It's not perfect, because the problem isn't perfectly specified (e.g., it's unclear what causes the human readable format to have a positive DEBIT vs a negative CREDIT). And the space of inputs is no doubt larger than the regex currently handles.
The point is just to show that, based on the examples given, regular expressions could be a compact solution to the general problem.

Related

why pytz.country_timezones('cn') in centos system have different result?

Two computer install centos 6.5, kernel is 3.10.44, have different result.
one result is [u'Asia/Shanghai', u'Asia/Urumqi'], and the other is ['Asia/Shanghai', 'Asia/Harbin', 'Asia/Chongqing', 'Asia/Urumqi', 'Asia/Kashgar'].
Is there any config that make the first result same as the second result?
I have following python code:
def get_date():
date = datetime.utcnow()
from_zone = pytz.timezone("UTC")
to_zone = pytz.timezone("Asia/Urumqi")
date = from_zone.localize(date)
date = date.astimezone(to_zone)
return date
def get_curr_time_stamp():
date = get_date()
stamp = time.mktime(date.timetuple())
return stamp
cur_time = get_curr_time_stamp()
print "1", time.strftime("%Y %m %d %H:%M:%S", time.localtime(time.time()))
print "2", time.strftime("%Y %m %d %H:%M:%S", time.localtime(cur_time))
When use this code to get time, the result of one computer(have 2 results) is:
1 2016 04 20 08:53:18
2 2016 04 20 06:53:18
and the other(have 5 results) is:
1 2016 04 20 08:53:18
2 2016 04 20 08:53:18
I don't know why?
You probably just have an outdated version of pytz on the system returning five time zones (or perhaps on both systems). You can find the latest releases here. It's important to stay on top of time zone updates, as the various governments of the world change their time zones often.
Like most systems, pytz gets its data from the tz database. The five time zones for China were reduced to two in version 2014f (corresponding to pytz 2014.6). From the release notes:
China's five zones have been simplified to two, since the post-1970
differences in the other three seem to have been imaginary. The
zones Asia/Harbin, Asia/Chongqing, and Asia/Kashgar have been
removed; backwards-compatibility links still work, albeit with
different behaviors for time stamps before May 1980. Asia/Urumqi's
1980 transition to UTC+8 has been removed, so that it is now at
UTC+6 and not UTC+8. (Thanks to Luther Ma and to Alois Treindl;
Treindl sent helpful translations of two papers by Guo Qingsheng.)
Also, you may wish to read Wikipedia's Time in China article, which explains that the Asia/Urumqui entry is for "Ürümqi Time", which is used unofficially in some parts of the Xinjiang region. This zone is not recognized by the Chinese government, and is considered a politically charged issue. As such, many systems choose to omit the Urumqi time zone, despite it being in listed in the tz database.

Greedy algorithm to finish a task with time constraint

This is a question from my midterm today and I wonder how to solve this. All i know is to prove the greedy algorithm using induction.
Question:
You are working on a programming project. There are n Java classes C1, C2, ..., Cn (the bossy architect says so). The architect also says that these classes have to be implemented in order (you are not allowed to implement C2 before you have completed C1 and so on).
Each of the Java classes takes at most 8 hours to implement. You work exactly 8 hours a day, and you should not leave a Java class unfinished at the end of the day.
To complete the project as soon as possible, a strategy is to implement as many classes as you can everyday. Prove that this greedy strategy is indeed the optimal one.
(Hint: let ti be the total number of classes completed in the first i days using the above strategy. The strategy always stays ahead if ti is no less than the total number of classes completed in the first i days using any other strategy)
This problem is similar to the classic task scheduling case where the waiting time in the system must be minimized.
Let C1, C2, ..., Cn your projected classes and c[1], c[2], ..., c[n] their required implementation time. Let's suppose you implement C1, C2, ... Cn in this order. Therefore, the total time (waiting + implementation) for each class Ck will be:
c[1] + c[2] + ... + c[k]
Therefore, we have the total time:
T = n·c[1] + (n – 1)·c[2] + ... + 2*c*[n – 1] + c[n] = sum(k = 1 to n) of (n – k + 1)·c[k]
(Sorry about the presentation — superscripts, subscripts, and math equations aren't supported...)
Let's suppose the implementation times in our permutation are not sorted by ascending order. We can therefore find two integers a and b such that a < b with c[a] > c[b]. If we switch them in the computation of T, we have:
T' = (n – a + 1)·c[b] + (n – b + 1)·c[a] + sum(k = 1 to n except a, b) of (n – k + 1)·c[k]
We finally compute T – T':
T – T' = (n – a + 1)(c[a] – c[b]) + (n – b + 1)(c[b] – c[a]) = (b – a)(c[a] – c[b])
Following our initial hypothesis (a < b and c[a] > c[b]), we have b – a > 0 and c[a] – c[b] > 0 as well, hence T – T' > 0.
This proves that we decrease the total waiting time by switching any pair of tasks so that the shorter one is done first.
Your problem statement is the same, except that before starting implementing a new class, you have to check whether you should start it now (if there is enough time left on the present day) or tomorrow. But the principle proven here holds when it comes to minimizing the total "waiting" time.
This is not a programming question for SO. The problem is not asking for a coding solution, rather its a proof that greedy is optimal. Which can be done with a proof by contradiction (no doubt taught in the class before the midterm).
What you want to do is to calculate the total time taken by greedy (there's only one solution) and disprove that any swaps in day would lead to a better solution. You probably also have to add something that mentions how swapping will allow u to permute the order to the optimal solution, if it exists.
I was going to write some formulae, but i realize Jeff Morin already has the equations, just going in the opposite direction. I think starting from the greedy solution might be easier to explain, since 'in order' is pretty much defined by the problem and you can only shift the work +- which day its done on.
The problem statement is incomplete. There is no indication that any class will take less than 8 hours. Since you can't leave any class unfinished, then you must start each class at the start of the day to be sure to have at least 8 hours to work on it. So if C2 really takes 3 hours and C3 really takes 5 hours, then a greedy algorithm would allow both classes to be done the same day. But after C2 takes 3 hours, you have to wait to day 3 to start C3 to be sure that you have enough time to finish since you don't know how long C3 will take.
So the restrictions really end up dictating that the effort will take n days, 1 day per class. So the implementation algorithm is strictly sequential, not greedy.
Edit Restrictions stated in problem.
(1) There are n Java classes C1, C2, ..., Cn
(2) these classes have to be implemented in order (you are not allowed to implement C2 before you have completed C1 and so on).
(3) Each of the Java classes takes at most 8 hours to implement
But there is no estimate for any class taking less than 8 hours.
(4) You work exactly 8 hours a day
(5) You should not leave a Java class unfinished at the end of the day.
The gist of this (3,4,& 5) is let's assume that I work on class 1 for 5 minutes. I now have 7 hours 55 minutes left. Can I start on Class 2? No because it might take up to 8 hours and I must finish before the end of my 8-hour day. So I must wait to day 2 to start class 2 and so on. Thus the implementation is strictly sequential and will take n days to complete, 1 day per class.
In order to use the Greedy algorithm you'd need additional information.
(6) You also know that each class has a known number of hours needed to code the class - h1, h2, h3, ..., hn. So class 1 takes h1 hours, class 2 takes h2 hours and so on. (From item 3 no class takes more than 8 hours)

variations in huffman encoding codewords

I'm trying to solve some huffman coding problems, but I always get different values for the codewords (values not lengths).
for example, if the codeword of character 'c' was 100, in my solution it is 101.
Here is an example:
Character Frequency codeword my solution
A 22 00 10
B 12 100 010
C 24 01 11
D 6 1010 0110
E 27 11 00
F 9 1011 0111
Both solutions have the same length for codewords, and there is no codeword that is prefix of another codeword.
Does this make my solution valid ? or it has to be only 2 solutions, the optimal one and flipping the bits of the optimal one ?
There are 96 possible ways to assign the 0's and 1's to that set of lengths, and all would be perfectly valid, optimal, prefix codes. You have shown two of them.
There exist conventions to define "canonical" Huffman codes which resolve the ambiguity. The value of defining canonical codes is in the transmission of the code from the compressor to the decompressor. As long as both sides know and agree on how to unambiguously assign the 0's and 1's, then only the code length for each symbol needs to be transmitted -- not the codes themselves.
The deflate format starts with zero for the shortest code, and increments up. Within each code length, the codes are ordered by the symbol values, i.e. sorting by symbol. So for your code that canonical Huffman code would be:
A - 00
C - 01
E - 10
B - 110
D - 1110
F - 1111
So there the two bit codes are assigned in the symbol order A, C, E, and similarly, the four bit codes are assigned in the order D, F. Shorter codes are assigned before longer codes.
There is a different and interesting ambiguity that arises in finding the code lengths. Depending on the order of combination of equal frequency nodes, i.e. when you have a choice of more than two lowest frequency nodes, you can actually end up with different sets of code lengths that are exactly equally optimal. Even though the code lengths are different, when you multiply the lengths by the frequencies and add them up, you get exactly the same number of bits for the two different codes.
There again, the different codes are all optimal and equally valid. There are ways to resolve that ambiguity as well at the time the nodes to combine are chosen, where the benefit can be minimizing the depth of the tree. That can reduce the table size for table-driven Huffman decoding.
For example, consider the frequencies A: 2, B: 2, C: 1, D: 1. You first combine C and D to get 2. Then you have A, B, and C+D all with frequency 2. Now you can choose to combine either A and B, or C+D with A or B. This gives two different sets of bit lengths. If you combine A and B, you get lengths: A-2, B-2, C-2, and D-2. If you combine C+D with B, you get A-1, B-2, C-3, D-3. Both are optimal codes, since 2x2 + 2x2 + 1x2 + 1x2 = 2x1 + 2x2 + 1x3 + 1x3 = 12, so both codes use 12 bits to represent those symbols that many times.
The problem is, that there is no problem.
You huffman tree is valid, it also gives the exactly same results after encoding and decoding. Just think if you would build a huffman tree by hand, there are always more ways to combine items with equal (or least difference) value. E.g. if you have A B C (everyone frequency 1), you can at first combine A and B, and the result with C, or at first B and C, and the result with a.
You see, there are more correct ways.
Edit: Even with only one possible way to combine the items by frequency, you can get different results because you can assign 1 for the left or for the right branch, so you would get different (correct) results.

Can a SHA-1 hash be all-zeroes?

Is there any input that SHA-1 will compute to a hex value of fourty-zeros, i.e. "0000000000000000000000000000000000000000"?
Yes, it's just incredibly unlikely. I.e. one in 2^160, or 0.00000000000000000000000000000000000000000000006842277657836021%.
Also, becuase SHA1 is cryptographically strong, it would also be computationally unfeasible (at least with current computer technology -- all bets are off for emergent technologies such as quantum computing) to find out what data would result in an all-zero hash until it occurred in practice. If you really must use the "0" hash as a sentinel be sure to include an appropriate assertion (that you did not just hash input data to your "zero" hash sentinel) that survives into production. It is a failure condition your code will permanently need to check for. WARNING: Your code will permanently be broken if it does.
Depending on your situation (if your logic can cope with handling the empty string as a special case in order to forbid it from input) you could use the SHA1 hash ('da39a3ee5e6b4b0d3255bfef95601890afd80709') of the empty string. Also possible is using the hash for any string not in your input domain such as sha1('a') if your input has numeric-only as an invariant. If the input is preprocessed to add any regular decoration then a hash of something without the decoration would work as well (eg: sha1('abc') if your inputs like 'foo' are decorated with quotes to something like '"foo"').
I don't think so.
There is no easy way to show why it's not possible. If there was, then this would itself be the basis of an algorithm to find collisions.
Longer analysis:
The preprocessing makes sure that there is always at least one 1 bit in the input.
The loop over w[i] will leave the original stream alone, so there is at least one 1 bit in the input (words 0 to 15). Even with clever design of the bit patterns, at least some of the values from 0 to 15 must be non-zero since the loop doesn't affect them.
Note: leftrotate is circular, so no 1 bits will get lost.
In the main loop, it's easy to see that the factor k is never zero, so temp can't be zero for the reason that all operands on the right hand side are zero (k never is).
This leaves us with the question whether you can create a bit pattern for which (a leftrotate 5) + f + e + k + w[i] returns 0 by overflowing the sum. For this, we need to find values for w[i] such that w[i] = 0 - ((a leftrotate 5) + f + e + k)
This is possible for the first 16 values of w[i] since you have full control over them. But the words 16 to 79 are again created by xoring the first 16 values.
So the next step could be to unroll the loops and create a system of linear equations. I'll leave that as an exercise to the reader ;-) The system is interesting since we have a loop that creates additional equations until we end up with a stable result.
Basically, the algorithm was chosen in such a way that you can create individual 0 words by selecting input patterns but these effects are countered by xoring the input patterns to create the 64 other inputs.
Just an example: To make temp 0, we have
a = h0 = 0x67452301
f = (b and c) or ((not b) and d)
= (h1 and h2) or ((not h1) and h3)
= (0xEFCDAB89 & 0x98BADCFE) | (~0x98BADCFE & 0x10325476)
= 0x98badcfe
e = 0xC3D2E1F0
k = 0x5A827999
which gives us w[0] = 0x9fb498b3, etc. This value is then used in the words 16, 19, 22, 24-25, 27-28, 30-79.
Word 1, similarly, is used in words 1, 17, 20, 23, 25-26, 28-29, 31-79.
As you can see, there is a lot of overlap. If you calculate the input value that would give you a 0 result, that value influences at last 32 other input values.
The post by Aaron is incorrect. It is getting hung up on the internals of the SHA1 computation while ignoring what happens at the end of the round function.
Specifically, see the pseudo-code from Wikipedia. At the end of the round, the following computation is done:
h0 = h0 + a
h1 = h1 + b
h2 = h2 + c
h3 = h3 + d
h4 = h4 + e
So an all 0 output can happen if h0 == -a, h1 == -b, h2 == -c, h3 == -d, and h4 == -e going into this last section, where the computations are mod 2^32.
To answer your question: nobody knows whether there exists an input that produces all zero outputs, but cryptographers expect that there are based upon the simple argument provided by daf.
Without any knowledge of SHA-1 internals, I don't see why any particular value should be impossible (unless explicitly stated in the description of the algorithm). An all-zero value is no more or less probable than any other specific value.
Contrary to all of the current answers here, nobody knows that. There's a big difference between a probability estimation and a proof.
But you can safely assume it won't happen. In fact, you can safely assume that just about ANY value won't be the result (assuming it wasn't obtained through some SHA-1-like procedures). You can assume this as long as SHA-1 is secure (it actually isn't anymore, at least theoretically).
People doesn't seem realize just how improbable it is (if all humanity focused all of it's current resources on finding a zero hash by bruteforcing, it would take about xxx... ages of the current universe to crack it).
If you know the function is safe, it's not wrong to assume it won't happen. That may change in the future, so assume some malicious inputs could give that value (e.g. don't erase user's HDD if you find a zero hash).
If anyone still thinks it's not "clean" or something, I can tell you that nothing is guaranteed in the real world, because of quantum mechanics. You assume you can't walk through a solid wall just because of an insanely low probability.
[I'm done with this site... My first answer here, I tried to write a nice answer, but all I see is a bunch of downvoting morons who are wrong and can't even tell the reason why are they doing it. Your community really disappointed me. I'll still use this site, but only passively]
Contrary to all answers here, the answer is simply No.
The hash value always contains bits set to 1.

How do I compute equinox/solstice moments?

What algorithms or formulas are available for computing the equinoxes and solstices? I found one of these a few years ago and implemented it, but the precision was not great: the time of day seemed to be assumed at 00:00, 06:00, 12:00, and 18:00 UTC depending on which equinox or solstice was computed. Wikipedia gives these computed out to the minute, so something more exact must be possible. Libraries for my favorite programming language also come out to those hardcoded times, so I assume they are using the same or a similar algorithm as the one I implemented.
I also once tried using a library that gave me the solar longitude and implementing a search routine to zero in on the exact moments of 0, 90, 180, and 270 degrees; this worked down to the second but did not agree with the times in Wikipedia, so I assume there was something wrong with this approach. I am, however, pleasantly surprised to discover that Maimonides (medieval Jewish scholar) proposed an algorithm using the exact same idea a millenium ago.
A great source for the (complex!) underlying formulas and algorithms is Astronomical Algorithms by Jean Meeus.
Using the PyMeeus implementation of those algorithms, and the code below, you can get the following values for the 2018 winter solstice (where "winter" refers to the northern hemisphere).
winter solstice for 2018 in Terrestrial Time is at:
(2018, 12, 21, 22, 23, 52.493725419044495)
winter solstice for 2018 in UTC, if last leap second was (2016, 12):
(2018, 12, 21, 22, 22, 43.30972542127711)
winter solstice for 2018 in local time, if last leap second was (2016, 12)
and local time offset is -7.00 hours:
(2018, 12, 21, 15, 22, 43.30973883232218)
i.e. 2018-12-21T15:22:43.309725-07:00
Of course, the answer is not accurate down to microseconds, but I also wanted to show how to do high-precision conversions with arrow.
Code:
from pymeeus.Sun import Sun
from pymeeus.Epoch import Epoch
year = 2018 # datetime.datetime.now().year
target="winter"
# Get terrestrial time of given solstice for given year
solstice_epoch = Sun.get_equinox_solstice(year, target=target)
print("%s solstice for %d in Terrestrial Time is at:\n %s" %
(target, year, solstice_epoch.get_full_date()))
print("%s solstice for %d in UTC, if last leap second was %s:\n %s" %
(target, year, Epoch.get_last_leap_second()[:2], solstice_epoch.get_full_date(utc=True)))
solstice_local = (solstice_epoch + Epoch.utc2local()/(24*60*60))
print("%s solstice for %d in local time, if last leap second was %s\n"
" and local time offset is %.2f hours:\n %s" %
(target, year, Epoch.get_last_leap_second()[:2],
Epoch.utc2local() / 3600., solstice_local.get_full_date(utc=True)))
Using the very cool more ISO and TZ aware module Arrow: better dates and times for Python, that can be printed more nicely:
import arrow
import math
slutc = solstice_epoch.get_full_date(utc=True)
frac, whole = math.modf(slutc[5])
print("i.e. %s" % arrow.get(*slutc[:5], int(whole), round(frac * 1e6)).to('local'))
I'm not sure if this is an accurate enough solution for you, but I found a NASA website that has some code snippets for calculating the vernal equinox as well as some other astronomical-type information. I've also found some references to a book called Astronomical Algorithms which may have the answers you need if the info somehow isn't available online.
I know you're looking for something that'll paste into an answer here, but I have to mention SPICE, a toolkit produced by NAIF at JPL, funded by NASA. It might be overkill for Farmer's Almanac stuff, but you mentioned interest in precision and this toolkit is routinely used in planetary science.
I have implemented Jean Meeus' (the author of the Astronomical Algorithms referenced above) equinox and solstice algorithm in C and Java, if you're interested.

Resources