Parse credit card statement - parsing

I'd love to develop my custom software to monitor my monthly expenses, but I'm struggling on the first step. Rather than having to input any expense into a big spreadsheet every now and then, I'd like to parse my banking credit card statement, to get as much information as possible.
From the most important to the least, here's what I'd like to know about each expense: price, category (food / travel / home...), date, store name, location.
However I don't know how to guess the category / shop corresponding to a given statement. If understand correctly, there is no real convention for the text, though most often it's the name of the owning company, which might not be the name of the store.
05/06 CB SAPPORO 10 -> 10$, Restaurant, 05/06, _name_of_the_restaurant_
Tbh I'm surprised that I could find anything online, it seems like something many people might want to do. Is there any kind of standard database, or existing library to do that ? How do the services which do that do it ?
Any help / redirection would be greatly appreciated :)

Related

Inventory Management: How do I handle sold inventory units in the database?

i sell liquor. so i have an inventory of bottles. so far i have an "InventoryUnit" model which references product and line_item.
should every single bottle be stored as an individual InventoryUnit object in my database?
what's the best practice to decrease my inventory? if i sell a bottle, do i destroy an InventoryUnit? or should i just add a status-column that can be "sold" or "in-stock"?
i'm worried for performance, can Postgres handle hundreds of thousands of InventoryUnit objects?
i'd really appreciate some help on this one. sorry, i'm a frontend-guy so i really suck at database-modelling…
One. should every single bottle be stored as an individual InventoryUnit object in my database?
If you can sell them individually, then yes, else track them by the case/box .
Two. what's the best practice to decrease my inventory? if i sell a bottle, do i destroy an InventoryUnit? or should i just add a
status-column that can be "sold" or "in-stock"?
Use the concepts of Locations and Movements (a movement should be its own entity). OpenERP for example uses "virtual locations" similar to this.
Bottle smashes? Move it from its inventory location to the "damaged" location
Bottle went missing? Move it from inventory to the "ether" location
Found a random bottle? Move it from "ether" to inventory
Sold a bottle? Move it from inventory to "sold"
Bought a bottle? Move it from purchased to inventory
Three. i'm worried for performance, can Postgres handle hundreds of thousands of InventoryUnit objects?
Postgres can handle hundreds of billions of objects. Normalize properly. Use small data types. Use indexes.
Some other things to keep in mind:
You could sell something, and it's returned, and you put it back in inventory
You could buy something, but it ain't right, so you send it back to the seller
You could sell something that you don't own (on consignment, or not in inventory yet)
You might have something in inventory that is not currently for sale.
For accounting inventory, you also need to count the goods on inbound and outbound shipments that you're responsible for, based on the free-on-board FOB status.
You need to count raw goods (DIY winemaking stuff?) and works in progress if you make/assemble anything, as well as ordering costs, etc.
Consigned goods are not counted in accounting inventory.
You should track inventory at the lowest fungible level. In otherwords, when you were going to pick a single unit off a shelf, what's the most specific information that you need to know in order to get the right thing.
In your example, I couldn't just say, "Go get a bottle" or you may bring back wine instead of vodka. I also can't say, "Go get a bottle of vodka" because you might bring back Absolut when I want Ciroc. Finally, I can't say, "go get a bottle of Ciroc" because you might bring back the 1L size when I wanted the 1.5L size.
I could say "Go get the third bottle from the left side in the front row of the bottom case of 1.5L Ciroc," but that would be silly because all 1.5L bottles of Ciroc are the same. (Flavor's aside ;) ).
The sweet spot becomes your stock keeping unit (SKU). Thankfully, almost every company in the world has solved this for you already. Just use the UPC number under the barcode as your SKU.
Based on this, your models would be something like...
InventoryOnHand
- id:int
- product_id:int
- quantity:int
Product
- id:int
- sku:string
- name:string
You would then increase and decrease the InventoryOnHand quantity as things go in and out of stock.
Ah that's a difficult one to answer. There's no right way although different models will have consequences depending on usage.
If you're tracking low volume but highly information intensive stock (say, airplanes and parts) you may want to define an entry for every item; if you're modeling a mass of identical products with close to no identity (as your case seems to be) I would go with focusing on the stock status. It all depends on how deeply you want to track the item's life cycle.
Ask yourself "is it worth to track this instance of a box of crackers, or can I just track how much a pallet has affected my current stock?" Would a gas station create an entry for every liter of fuel?

Questions about implementing surrogate key in Ruby on Rails

For an upcoming project we need to have unique real world identifiers that are exposed to users for things like Account Numbers or Case Numbers (like a bug tracking ID). These will always be system generated and unchangeable. Right now we plan to run strictly on Heroku.
While (as my name would suggest) I am new to the wonderfulness that is Ruby on Rails, I have a long background in enterprise application development. I'm trying to bridge between what I have done in the past while doing in the "RoR way"
Obviously RoR has wonderful primary key support. I have read dozens of posts here recommending to adapt business requirements to just use the out of the box id/key methodology.
So let me describe what I am trying to accomplish and please let me know if you have faced similar objectives and what approach you took.
1) Would like to have a human readable key with a consistent length. There is value in always having an Account ID or Transaction ID that is the same length (for form validation, training sales staff, etc.) Using Ruby's innate key generation one could just add buffer characters (e.g. 100000 instead of 1).
2) Compactness: My initial plan was to go with a base 36 unique key (e.g. 36 values [0..9],[a..z]). As part of our API/interface we plan on exposing certain non-confidential objects based on a shortform URL (e.g. xx.co/000001). I like the idea of being able to have a five character identifier in base 36 vs. 7+ in decimal.
So I can think of two possible approaches:
a) add my own field and develop my own unique key generator (or maybe someone will point me to one).
b) Pad leading digits (and I assume I can force the unique key generation to start at 1xxxxxxx rather than 0000001). Then use the to_s(36) method to convert it to and from base 36 for all interactions with humans. Maybe even store the actual ID value in the database in the base 36 format to avoid ongoing conversions, but always do the conversion before a query to avoid the need to have another index.
I'm leaning towards approach B, as it seems like it would be optimal from a DB performance standpoint and that it would require the least investment in non-value added overhead. Once again, any real world experience with these topics and thoughts on the best approach would be greatly appreciated.
Thanks in advance!
I would never use the primary key in a Rails table for anything of business importance. There will come a day when someone on the business end will want to change it, and it'll end up being an enormous pain in the butt and will invalidate a bunch of URLs you and your users thought were canonical and will mess up all your foreign keys and blah blah blah. It's just a really bad idea and I would encourage you not to do it.
The Rails way to do this is have a new column, called something like number or bug_tracking_number or whatever strikes your fancy, and before_validation implement a callback that gives it a value. This is where you can let your creativity shine; something like this sounds like what you want:
before_validation( :on => :create ) do
self.number = CaseNumber.count + 1
end
You can pad the number there, ensure its uniqueness, or do whatever else you want.

New(?) attept to structure RESTful base URLs

We all love REST, especially when it comes to the development of APIs. Doing so for the last years I always stumble upon the same problem: nested resources. It seems we're living at the two edges of a scale. Let me introduce an example.
/galaxies/8/solarsystems/5/planets/1/continents/4/countries.json
Neato. Cases like that seem to happen everywhere, no matter in what shape they materialize. Now I'd like to being able to fetch all the countries in a solar system while being able to fetch countries deeply scoped as shown above.
It seems I have two choices here. The first one, I flatten my nested structure and introduce a lot of GET parameters (that need to be well documented and understood by my API user) like so:
/countries.json?galaxy=8&solarsystem=5&planet=1&continent=4
I could flatten all my resources like so and won a unique endpoint base URL for each one. Good point … unique endpoints per resource!
But what's the price? Something that does not feel natural, is not discoverable and does not behave like the tree structure below my resources. Conclusion: Bad idea, but well practiced.
On the other hand I could try to get rid of as many additional GET parameters as possible, creating endpoints like that:
/galaxies/8/solarsystems/5/countries.json
But I also needed:
/galaxies/8/solarsystems/5/planets/1/continents/4/countries.json
This seems to be the other side of the scale. Least number of additional GET parameters, more natural behave but still not what I expected as an API user.
The most APIs I worked with in the last year follow the one or the other paradigm. It seems there is at least one bullet to bite. So why not doing the following:
If there are resources that nest naturally, lets nest them exactly in the way we'd expect them to be nested. What we achive is at first a unique endpoint for every resource when we stay like that:
/galaxies.json
/galaxies/8/solarsystems.json
/galaxies/8/solarsystems/5/planets.json
/galaxies/8/solarsystems/5/planets/1/continents.json
/galaxies/8/solarsystems/5/planets/1/continents/4/countries.json
Ok, but how to solve the initial problem, I wanted to fetch all the countries in a solar system while still being able to fetch countries fully scoped under galaxies, solar systems, planets and continents? Here's what feels natural for me:
/galaxies/8/solarsystems/5/planets/0/continents/0/countries.json # give me all countries in the solarsystem 5
/galaxies/8/solarsystems/0/planets/0/continents/0/countries.json # give me all countries in the galaxy 8
… and so on, and so on. Now you may argue "ok, but the zero there ….." and you are right. Does not look really nice. So why not change the two upper calls to something like that:
/galaxies/8/solarsystems/5/planets/all/continents/all/countries.json # give me all countries in the solarsystem 5
/galaxies/8/solarsystems/all/planets/all/continents/all/countries.json # give me all countries in the galaxy 8
Neat eh? So what do we achive? No additional GET parameters and still stable base URLs for each resources endpoint. What's the price? Yep, at least longer URLs especially during testing by hand using tools like curl.
I wonder wether this could be a way to improve not only the maintainability but also the ease of use of APIs. If so, why does not anyone take an approach like that. I can not imagine to be the first one having that idea. So there must be valid counter arguments against an approach like that. I don't see any. Do you?
I would really like to hear your opinion and arguments for or against an approach like that. Maybe there are ideas for improvement … would be great to hear from you. In my opinion this could lead to much better structured APIs, so hopefully someone will read that and reply.
Regards.
Jan
It would all depend on upon how the data is presented. Would the user really need to the know the galaxy # to find a specific country? If so them what you propose makes sense. However, it seems to me that what you are proposing, while structured and presented well, doesn't allow for clients to search for child element unless the parent is a known quantity.
In your example, if I had a specific id for a continent I would need to know the planet, solar system and galaxy as well. In order to find the specific continent I would need to get all for each possible parent until I found the continent.
Presenting structured data in this manner if fine. Using this structure when you only have a piece of the data may be a bit cumbersome. It all depends upon what you are trying to accomplish.
Nested resource URLs are usually bad. The approach I generally take is to use unique IDs.
Design your DB so that it is only going to have one continent with ID 4. Then, instead of the horrible /galaxies/8/solarsystems/5/planets/1/continents/4/countries.json, all you need is the simple /continents/4/countries.json. Clear, sufficient, and memorable.
The :shallow routing option in Rails does this automatically.
For "all countries in a solar system", I'd use /solar_systems/5/countries.json -- that is, don't try to shoehorn it into the generic URL scheme. (And note the underscore.)

Help me understand mnesia (NoSQL) modeling

In my Quest to understanding Mnesia, I still struggle with thinking in relational terms. So I will put my struggles up here and ask for the best way to solve them.
one-to-many-relations
Say I have a bunch of people,
-record(contact, {name, phone}).
Now, I know that I can define phone to always be saved as a list, so people can have multiple phone numbers, and I suppose that's the way to do it (is it? How would I then look this up the other way around, say, finding a name to a number?).
many-to-many-relations
now let's suppose I have multiple groups I can put people in. The group names don't have any significance, they are just names; the concept is "unix system groups" or "labels". Naively, I would model this membership as a proplist, like
{groups [{friends, bool()}, {family, bool()}, {work, bool()}]} %% and so on...
as a field within the "contact" record from above, for example. What is the best way to model this within mnesia if I want to be able to lookup all members of a group based on group name quickly, and also want to be able to lookup all group an individual is registered in? I also could just model this as a list containing just the group identifiers, of course. For use with mnesia, what is the best way to model this?
I apologize if this question is dumb. There's plenty of documentation on mnesia, but it's lacking (IMO) some good examples for the overall use.
For the first example, consider this record:
-record(contact, {name, [phonenumber, phonenumber, ...]}).
contact is a record with two fields, name and phone where phone is a list of phone numbers. As user425720 said it could make sense to store these as something else than strings, if you have extreme requirements for small storage footprint, for example.
Now here comes the part that is hard to "get" with key-value stores: you need to also store the inverse relationship. In other words, you need something similar to the following:
-record(phone, {phonenumber, contactname}).
If you have a layer in your application to abstract away database handling, you could make it always add/change the phone records when adding/changing a contact.
--
For the second example, consider these two records:
-record(contact, {uuid, name, [group_id, group_id]}).
-record(group, {uuid, name, [contact_id, contact_id]}).
The easiest way is to just store ids pointing to the related records. As Mnesia has no concept of referential integrity, this can become out of sync if you for example delete a group without removing that group from all users.
If you need to store the type of group on the contact record, you could use the following:
-record(contact, {name, [{family, [group_id, group_id]}, {work, [..]}]}).
--
Your second problem could also be solved by using a intermediate record, which you can think of as "membership".
-record(contact, {uuid, name, ...}).
-record(group, {uuid, name, ...}).
-record(membership, {contact_uuid, group_uuid}). # must use 'bag' table type
There can be any number of "membership" records. There will be one record for every users group.
First of all, you ask for key-value store design patters. Perfectly fine.
Before I will try to answer your question lets make it clear - what is Mnesia. It is k-v DB, which is included in OTP. Because it is native, it is very comfortable to use from Erlang. But be careful. This is old database with very ancient assumptions (e.g. data distribution with linear hashing). So go ahead, learn and play with it, but for production take your time and browse NoSQL shop to find the best for your needs.
#telephone example. Do not store stuff as strings (list()) - it is very heavy for GC. I would make couple fields like phone_1 :: < < binary > > , phone_2 :: < < binary > >, phone_extra :: [ < < binary > > ] and build index on the most frequent query-field. Also mnesia indicies are tricky - when node crashes and goes up, they need to rebuild themselves (it can take awfully lot of time).
#family example. It quite hard with flat namespace. You may play with more complex keys.. Maybe create separate table for TheGroup and keep identifiers of members? Or each member would have ids of groups he belongs (hard to maintain..). If you want to recognize friends I would implement some sort of contract before presenting data (A is B's friend iff B is A's friend) - this approach would cope with eventual consistency and conflicts in data.

How would you design a hackable url

Imagine you had a group of product categories organized in a nice tree hierarchy and you wanted to provide hackable urls to browse these. You could do something like this
/catalog/categorya/categoryb/categoryc
You could then quite easily figure out which category you should list the products for (note that the full URL is needed since you could have categories with the same name but at different locations in the hierarchy)
Now what would be a good approach to add product information in that as well? To give you an example, you wanted to display the product Oblivion for this category
/catalog/games/consoles/playstation/adventure
It's tempting to just add the product at the end of the url
/catalog/games/consoles/playstation/adventure/oblivion
but the moment you do so you loose the ability to know if its category or a product which is called oblivion. I personally feel that not being forced to add a suffix such as .html
/catalog/games/consoles/playstation/adventure/oblivion.html
would be the nicest solution and using some sort of prefix, such as
/catalog/games/consoles/playstation/adventure/product:oblivion
You could also add some sort of trigger like
/catalog/games/consoles/playstation/adventure/PRODUCT/oblivion
not as nice either and you would (even though its very unlikely it would be a problem) restrict yourself from having a category called product
So far a suffix solution looks like the most user-friendly approach that I can think of from the top of my head but I'm not fond of having to use an extension
What are your thoughts on this?
Deep paths irk me. They're hideous to share.
/product/1234/oblivion --> direct page
/product/oblivion --> /product/1234/oblivion if oblivion is a unique product,
--> ~ Diambiguation page if oblivion is not a unqiue product.
/product/1234/notoblivion -> /product/1234/oblivion
/categories/79/adventure --> playstation adventure games
/categories/75/games --> console games page
/categories/76/games --> playstation games page
/categories/games --> Disambiguation Page.
Otherwise, the long urls, while seeming hackable, require you to get all node elements right to hack it.
Take php.net
php.net/str_replace --> goes to
http://nz2.php.net/manual/en/function.str-replace.php
And this model is so hackable people use it all the time blindly.
Note: The .html suffix is regarded by the W3C as functionally meaningless and redundant, and should be avoided in URLs.
http://www.w3.org/Provider/Style/URI
Lets disect your URL in order to be more DRY (non-repetitive). Here is what you are starting with:
/catalog/games/consoles/playstation/adventure/oblivion
Really, the category adventure is redundant as the game can belong to multiple genres.
/catalog/games/consoles/playstation/oblivion
The next thing that strikes me is that consoles is also not needed. It probably isn't a good idea to differentiate between PC's and Console machines as a subsection. They are all types of machines and by doing this you are just adding another level of complexity.
/catalog/games/playstation/oblivion
Now you are at the point of making some decisions about your site. I would recommend removing the playstation category on your page, as a game can exist across multiple platforms and also the games category. Your url should look like:
/catalog/oblivion
So how do you get a list of all the action games for the Playstation?
/catalog/tags/playstation+adventure
or perhaps
/catalog/tags/adventure/playstation
The order doesn't really matter. You have to also make sure that tags is a reserved name for a product.
Lastly, I am assuming that you cannot remove the root /catalog due to conflicts. However, if your site is tiny and doesn't have many other sections then reduce everything to the root level:
/oblivion
/tags/playstation/adventure
Oh and if oblivion isn't a unique product just construct a slug which includes it's ID:
/1234-oblivion
Those all look fine (except for the one with the colon).
The key is what to do when they guess wrong -- don't send them to a 404 -- instead, take the words you don't know and send them to your search page results for that word -- even better if you can spell check there.
If you see the different pieces as targets then the product itself is just another target.
All targets should be accessable by target.html or only target.
catalog/games/consoles/playstation.html
catalog/games/consoles/playstation
catalog/games/consoles/playstation/adventure.html
catalog/games/consoles/playstation/adventure
catalog/games/consoles/playstation/adventure/oblivion.html
catalog/games/consoles/playstation/adventure/oblivion
And so on to make it consistent.
My 5 cents...
One problem is that your user's notion of a "group of product categories organized in a nice tree hierarchy" may match yours.
Here's a google tech talk by David Weinberger's "Everything is Miscellaneous" with some interesting ideas on categorizing stuff:
http://www.youtube.com/watch?v=x3wOhXsjPYM
#Lou Franco yeah either method needs a sturdy fallback mechanism and sending it to some sort of suggestion page or seach engine would be good candidates
#Stefan the problem with treating both as targets are how to distinguish them (like I described). At worst case scenario is that you first hit your database to see if there is a category which satisfies the path and if it doesn't then you check if there is a product which does. The problem is that for each product path you will end up making a useless call to the database to make sure its not a category.
#some yeah a delimiter could be a possible solution but then a .html suffix is more userfriendly and commonly known of.
i like /videogames/consolename/genre/title" and use the amount of /'s to distinguish between category or product. The only thing i would be worried about multi (or hard to distinguish) genre. I highly recommend no extension on title. You could also do something like videogames(.php)?c=x360;t=oblivion; and just guess the missing information however i like the / method as it looks more neat. Why are you adding genre? it may be easier to use the first letter of the title or just to do videogame/console/title/
My humble experience, although not related to selling games, tells me:
editors often don't use the best names for these "slugs", they don't chose them wisely.
many items belong (logically) to several categories, so why restrict them (technically) to a single category?
Better design item urls by ids, (i.e. /item/435/ )
ids are stable (generated by the db, not editable by the editor), so the url stands a much bigger chance at not being changed over time
they don't expose (or depend on) the organization of the objects in the database like the category/item_name style of urls does. What if you change the underlying design (object structure) to allow an item to belong to multiple categories? the category/item urls suddenly won't make sense anymore; you'll change your url design and old urls might not work anymore.
Labels are better than categories. That is to say, allowing an item to belong to several categories is a better approach than assigning one category to each item.
the problem with treating both as
targets are how to distinguish them
(like I described). At worst case
scenario is that you first hit your
database to see if there is a category
which satisfies the path and if it
doesn't then you check if there is a
product which does. The problem is
that for each product path you will
end up making a useless call to the
database to make sure its not a
category.
So what? There's no real need to make a hard distinction between products and categories, least of all in the URI, except maybe a performance concern over an extra database call. If that's really such a big deal to you, consider these two suggestions:
Most page views will presumably be on products, not categories. So doing the check for a product first will minimize the frequency with which you need to double up on the database lookups.
Add code to your app to display the time taken to generate each page, then go out to the nearest internet cafe (not your internal LAN!) with a stopwatch. Bring up some pages from your site and time how long each takes to come up. Subtract the time taken to generate the page. Also compare the time taken to generate one-database-lookup pages vs. two-database-lookup pages. Then ask yourself, when it takes maybe 1-2 seconds total to establish a network connection, generate the content, and download the content, does it really matter whether you're spending an extra 0.05 second or less for an additional database lookup or not?
Optimize where it matters, like making URLs that will be human-friendly (as in Chris Lloyd's answer). Don't waste your time trying to shave off the last possible fraction of a percent.

Resources