Interval Partitioning by Finish Time? - greedy

Given n lectures, each with start time and finish time, the problem is to assign all the lectures to rooms such that no two lectures occur at the same time in the same room. It is easy to design a greedy algorithm that sorts lectures by start time to minimize the number of rooms used. What if we process lectures by their finish time? Does the greedy algorithm still work? I tried a few examples, and did not find a counter example yet.
Greedy algorithm by Start Time
Sort all the lectures by start time in ascending order
Place all the used rooms in a priority queue. The priority of a room is based on the finish time of last lecture scheduled in the room.
Assign each lecture to the first available room in the queue. A room is available if the finish time of last lecture scheduled in the room is less than the start time of the current lecture (no overlap).
If none of rooms in the queue is available, add a new room, and insert the lecture into the new room.
The above greedy algorithm works. What if we change step 1, sort lectures by finish time, and keep other steps the same? Does it still work?
If we sort lectures by finish time in descending order, that is, we would process the last lecture first, that is a symmetric of the above greedy algorithm, I believe it definitely works.
But my question is, we sort lectures by finish time in ascending order, apply the greedy algorithm, does it work?

Related

Create a model that predicts an event based on other time series events and properties of an object

I have the following data:
Identifier of a person
Days in location (starts at 1 and runs until event)
Age of person in months at that time (so this increases as the days in location increase too).
Smoker (boolean), doesn't change over time in our case
Sex, doesn't change over time
Fall (boolean) this is an event that may never happen, or can happen multiple times during the complete period for a certain person
Number of wounds: (this can go from 0 to 8), a wound mostly doesn't heal immediately so it mostly stays open for a certain period of time
Event we want to predict (boolean), only the last row of a person will have value true for this
I have this data for 1500 people (in total 1500000 records so on average about 1000 records per person). For some people the event I want to predict takes place after a couple of days, for some after 10 years. For everybody in the dataset the event will take place, so the last record for a certain identifier will always have the event we want to predict as 1.
I'm new to this and all the documentation I have found so far doesn't demonstrate time series for multiple persons or objects. When I for example split the data in the machine learning studio, I want to keep records of the same person over time together.
Would it be possible to feed the system after the model is trained with new records and for each day that passes it would give the estimate of the event taking place in the next 5 days?
Edit: sample data of 2 persons: http://pastebin.com/KU4bjKwJ
sounds like very similar to this sample:
https://gallery.cortanaintelligence.com/Experiment/df7c518dcba7407fb855377339d6589f
Unfortunately there is going to be a bit of R code involved. Yes you should be able to retrain the model with new data.

Maintaining a Relative Order of Flex Tasks in a SQL DB

I have a task scheduling app that allows people to create 2 types of tasks...
•Strict- tasks with a set start time and duration
•Flex- tasks that have a duration, but no specific start time
Its also important to understand how flex tasks operate- Flex tasks will continuously reschedule themselves throughout your day in the nearest time you have open...so for example if the only task on your schedule today is a flex task like "Go workout - duration:60mins" and you open the app at 4pm it will have "Go workout" scheduled from 4-5pm for you , if you dont click the checkbox indicating you completed the task and open the app again at 5PM "Go workout will be rescheduled to 5-6pm so that the stuff you are meaning to get done is constantly in your face and trying to fit itself into the gaps of your life.
When a user views their schedule here are the steps I go through:
•Grab a array of all strict tasks
•Grab a array of all flex tasks
•Loop through each strict task and figure out how big of a time gap there is between the task currently being looped's end time and the next tasks start time.
•if a gap exists loop through the flex tasks and see if any of them will fit in the time gap in question. if a flex task is small enough to fit in the time gap add it to the strictTasksArray between the task being currently looped and the next task.
This works great as long as there is no need for any kind of ordering when it comes to flex tasks, but now we have added the ability for users to drag and drop flex tasks into a specific relative order aka if I have Task A,B,C,D
and I drag Task D & B to the front so that its now D,B,A,C it needs to save that ordering so that if you close and reopen the app the app will still remember to try to fit task D in , followed by B, A & C .....but im having big trouble thinking of a efficient way to do considering the ordering is relative and not strict...any ideas how to save relative ordering in a SQLIte DB without having to update every tasks's DB record every time a user drag/drops a task and changes the relative ordering?
If you have ever coded in Basic, you might remember numbering code lines. It was advisable to number in increments of 10 so that if later on you would have to insert a line or two you won't have to re-number all the code, just assign a new number in-between those of the previous and the next lines.
So, in your situation I would create a numeric field for Rank and for each new Flex task assign Rank = max(Rank) + 1024 (for example). Afterwards if the tasks are rearranged I would update just one "moved" task's Rank with the average Rank of it's new previous and next neighbours. That way any Rank change would be an update for one row only. Of cause if the Rank is int and I run out of integers in-between two tasks I would have to update them all, but that should be a rear occasion and I would just re-Rank them in new increments of 1024.
Sounds like you'd need some sort of either priority or order_number column to set the order in which the tasks come in. Just make it an int, and weight them accordingly. If you needed the DBMS to keep them in order using a query, you'd have to use sorting:
SELECT task_id, task_group_id, task_name, completed, priority
FROM tasks WHERE user = ? and task_group_id = ? and completed = 0 ORDER BY priority ASC
you can use some sort of foreign key to a task_group table to actually group certain tasks together if they're multipart, and then build a query to find all the ones that are either complete or incomplete. The weightage assigned would still be correct, because the tasks don't refer to each other by ID.

Handling change of grain for a snapshot fact table in a star-schema

The question
How do you handle a change in grain (from weekly measurement to daily measurement) for a snapshot fact table.
Background info
For a star-schema design I want to incorporate the results of a survey as a fact (e.g. in week 2 of 2015 80% of the respondents have responded 'yes', in week 3 76% etc.)
This survey is conducted each week, and I only have access to the result of the survey (% of people saying yes this week) and not to the individual responses.
Based on (my interpretation of) Christopher Adamson's "Star Schema: The complete reference" I believe I should use a snapshot fact table for these kind of measurements.
The date dimension for this fact should be on the week-level, and be a conformed rollup of a more fine-grained date dimension for other facts in other stars that take place on a daily basis.
Here comes trouble
Now someone decides they want to conduct these surveys daily instead of weekly. What is the best way to handle this? Some of the options I'm currently considering:
change the week dimension to a daily one, and fake the old facts as if they happened on the last day of the week.
change the week dimension to a daily one, and add 7 facts for each weekly one.
create a new star, with the daily fact and dimension and treat the old one as an aggregate.
I'd appreciate any input. Please tell me if my logic is off, or my question is not clear :)
I'm not convinced that this is a snapshot. Each survey response represents a "transaction".
With an appropriate date dimension you can calculate the Yes/No percentages, rolled up by week.
Further, this would enable you to show results like "Surveys issued on a Sunday night get more responses", or "People who respond on Friday are more likely to answer 'Yes'". (contrived examples)
Following clarification, this does look like a periodic snapshot. The example of a bank account balance is often used to describe a similar scenario.
A key feature of a periodic snapshot is that every combination of every dimension should be present. If your grain is monthly, then every month you record the fact, even if it has not changed from the previous month.
I think that is the key to your problem. Knowing that your grain may change from weekly to daily, make your grain daily. It does mean you'll be repeating the weekly value on every day of the week, but that is a true representation of your knowledge of the fact; on Wednesday you only knew that its value was the same as Monday.
If you design your ETL right, you won't need to make any changes when the daily updates begin.
Your second option is the one I'd choose in your place.

Greedy Algorithm implementation

So I have some questions concerning the solution to the problem of scheduling n activities that may overlap using the least amount of classrooms possible. The solution is below:
Find the smallest number of classrooms to schedule a set of activities S in. To do this efefficiently
move through the activities according to starting and finishing times. Maintain two lists of classrooms: Rooms that are busy at time t and rooms that are free at time t. When t is the starting time
for some activity schedule this activity to a free room and move the room to the busy list.
Similarly, move the room to the free list when the activity stops. Initially start with zero rooms. If
there are no rooms in the free list create a new room.
The algorithm can be implemented by sorting the activities. At each start or finish time we can
schedule the activities and move the rooms between the lists in constant time. The total time is thus
dominated by sorting and is therefore O(n lg n).
My questions are
1) First, how do you move through the activities by both starting and finishing time at the same time?
2) I don't quite understand how it's possible to move the rooms between lists in constant time. If you want to move rooms from the busy list to the free list, don't you have to iterate over all the rooms in the busy list and see which ones have end times that have already passed?
3) Are there any 'state' variables that we need to keep track of while doing this to make it work?
The way the algorithm works, you need to create a list containing an element for each start time and an element for each end time (so 2n elements in total if there are n activities). Sort this list. When an end time and a start time are equal, sort the end time first -- this will cause back-to-back bookings for halls to work.
If you use linked lists for holding the free and booked halls, you can have the elements you created in step 1 hold pointers back to an activity structure, and this structure can hold a pointer to the list element containing the hall that this activity is assigned to. This will be NULL initially, and will take on a value when that hall is used for that activity. Then when that activity ends, its hall can be looked up in constant time by following two pointers from the activity-end element (first to the activity object, and from there to the hall element).
That should be clear from the above description, hopefully.

Scalable time decay for web application

My goal here is to generate a system similar to that of the front page of reddit.
I have things and for the sake of simplicity these things have votes. The best system I've generated is using time decay. With a halflife of 7 days, if a vote is worth 20 points today, then in seven days, it it worth 10 points, and in 14 days it will only be worth 5 points.
The problem is, that while this produces results I am very happy with, it doesn't scale. Every vote requires me to effectively recompute the value of every other vote.
So, I thought I might be able to reverse the idea. A vote today is worth 1 point. A vote seven days from now is worth 2 points, and 14 days from now is worth 4 points and so on. This works well because for each vote, I only have to update one row. The problem is that by the end of the year, I need a datatype that can hold fantastically huge numbers.
So, I tried using a linear growth which produced terrible rankings. I tried polynomial growth (squaring and cubing the number of days since site launch and submission) and it produced slightly better results. However, as I get slightly better results, I'm quickly re-approaching unmaintainable numbers.
So, I come to you stackoverflow. Who's got a genius idea or link to an idea on how to model this system so it scales well for a web application.
I've been trying to do this as well. I found what looks like a solution, but unfortunately, I forgot how to do math, so I'm having trouble understanding it.
The idea is to store the log of your score and sort by that, so the numbers won't overflow.
This doc describes the math.
https://docs.google.com/View?id=dg7jwgdn_8cd9bprdr
And the comment where I found it is here:
http://blog.notdot.net/2009/12/Most-popular-metrics-in-App-Engine#comment-25910828
Okay, thought of one solution to do that on every vote. The catch is that it requires a linked list with atomic pop/push on both sides to store votes (e.g. Redis list, but you probably don't want it in RAM).
It also requires that decay interval is constant (e.g. 1 hour)
It goes like this:
On every vote, update the score push the next time of decay of this vote to the tail of the list
Then pop the first vote from the head of the list
If it's not old enough to decay, push it back to the head
Otherwise, subtract the required amount from the total score and push the updated information to the tail
Repeat from step 2 until you hit a fresh enough vote (step 3)
You'll still have to check the heads in background to clear the posts that no one votes on anymore, of course.
It's late here so I'm hoping someone can check my math. I think this is equivalent to exponential decay.
MySQL has a BIGINT max of 2^64
For simplicity, lets use 1 day as our time interval. Let n be the number of days since the site launched.
Create an integer variable. Lets call it X and start it at 0
If an add operation would bring a score over 2^64, first, update every score by dividing it by 2^n, then set X equal to n.
On every vote, add 2^(n-X) to the score.
So, mentally, this makes better sense to me using base 10. As we add things up, our number gets longer and longer. We stop caring about the numbers in the lower digit places because the values we're incrementing scores by have a lot of digits. Which means that the lower digits kind of stop counting for very much. So if they don't count, why not just slide the decimal place over to a point that we care about and truncate the digits below the decimal place at some point. To do this, we need to slide the decimal place over on the amount we're adding each time as well.
I can't help but feel like there's something wrong with this.
Here are two possible pseudo queries that you could use. I know that they don't really address scalability, but I think that they do provide methods so that you can
SELECT article.title AS title, SUM(vp.point) AS points
FROM article
LEFT JOIN (SELECT 1 / DATEDIFF(NOW(), vote.created_at) as point, article_id
FROM vote GROUP BY article_id) AS vp
ON vp.article_id = article.id
or (not in a join, which will be a bit faster I think, but harder to hydrate),
SELECT SUM(1 / DATEDIFF(NOW(), created_at)) AS points, article_id
FROM vote
WHERE article_id IN (...) GROUP BY article_id
The benefit of these queries is that they can be run at any time with the same data and they will always return the same answers. They don't destroy any data.
If you need to, you can also run the queries in a background job and they will still give the same result.

Resources