File Path Name or URL analysis - machine-learning

I am looking for information on tools, methods, techniques for analysis of file path names. I am not talking file size, read/write times, or file types, but analysis of the path or URL it self.
I am only aware of basic word frequency text tools or methods, but I am wondering if there is something more advanced that people use/apply to this to try and mine extra information out of them.
Thanks!
UPDATE:
Here is the most narrow example of what I would want. OK, so I have some full path names as strings like this:
F:\Task_Order_Projects\TO_01_NYS\Models\MapShedMaps\Random_File1.doc
F:\Task_Order_Projects\TO_01_NYS\Models\MapShedMaps\Random_File2.doc
F:\Task_Order_Projects\TO_01_NYS\Models\MapShedMaps\Random_File3.doc
F:\Task_Order_Projects\TO_01_NYS\Models\MapShedMaps\Random_File4.doc
F:\Task_Order_Projects\TO_01_NYS\Models\MapShedMaps\Random_File5.doc
F:\Task_Order_Projects\TO_02_NYS\Models\MapShedMaps\Random_File1.doc
F:\Task_Order_Projects\TO_02_NYS\Models\MapShedMaps\Random_File2.doc
F:\Task_Order_Projects\TO_02_NYS\Models\MapShedMaps\Random_File3.doc
F:\Task_Order_Projects\TO_02_NYS\Models\MapShedMaps\Random_File4.doc
F:\Task_Order_Projects\TO_02_NYS\Models\MapShedMaps\Random_File5.doc
What I want to know is that the folder MapShedMaps appears "uniquely" 2 times. If I do frequency on the strings I would get 10 appearances. The issues is that I don’t know what level in the directory this is important, so I would like a unique count at each level of the directory based on what I am describing.

This is an extremely broad question so it is difficult for me to give you a per say "Answer" but I will give you my first thoughts on this.
First,
the Regular expression class of .NET is extremely useful for parsing large amounts of information. It is so powerful that it will easily confuse the impatient, however once mastered it can be used across text editors, .NET and pretty much any other respectable language I believe. This would allow you to search strings and separate it into directories. This could be overkill depending on how you use it, but its a thought. Here is a favorite link of mine to try out some regular expressions.
Second,
You will need a database, I prefer to use SQL. Look into how to connect to databases and create databases. With this database you can store all the fields abstracted from your original path entered. Such as a parent directory, child directory, common file types accessed. Just have a field for each one of these and through queries you can form a hypothesis as to redundancy.
Third,
I don't know if its easily accessible but you might look into whether windows stores accessed file history. It seems to have some inkling as to which files have been opened in the past. So there may be a resource in windows which already stores much of the information you would be storing in your database. If you could find a way to access this information. Parse it with regular expressions and resubmit it to the database of your application. You could control the WORLD! j/k... You could get a pretty good prediction as to user access patterns though.
Fourth,
I always try to stick with what I have available. If .NET is sitting in front of you, hammer away at what your trying to do. If you reach a wall. At least your making forward progress. In today's motion towards object orientated programming, you can usually change data collected by one program into an acceptable format for another. You just gotta dig a little.
Oh and btw, Coursera.com is actually doing a free class on machine learning and algorithms. You might want to check it out or reference it for prediction formulas.
Good Luck.

I wanted to post this as a comment but SO kept editing the double \ to \ and it is important there are two because \ is a key character, without another \ to escape it, regex will interpret it as a command.
Hey I just wanted to let you know I've been playing with some regex... I know a pretty easy way to code this up in VB.net and I'll post that as my second answer but I wanted you to check out back-references. If the part between parenthesis matches it captures that text and moves on to the second query for instance....
F:\\(directory1)?(directory2)?(directory3)?
You could use these matches to find out how many directories each parent directory has under it. Are you following me? Here is a reference.

Related

reading/parsing common lisp files from lisp without all packages available or loading everything

I'm doing a project which involves parsing the histories of common lisp repos. I need to parse them into list-of-lists or something like that. Ideally, I'd like to preserve as much of the original source file syntax as possible, in some way. For example, in the case of the text #+sbcl <something>, which I think means "If our current lisp is sbcl, read <something>, otherwise skip it", I'd like to get something like (#+ 'sbcl <something>).
I originally wrote a LALR parser in Python, which sort of worked, but it's not ideal for many reasons. I'm having a lot of difficulty getting correct output, and I have tons of special cases to add.
I figured that what I should really do is is use lisp itself, since it already has a lisp parser built in. If I could just read a file into sexps, I could dump it into something (cl-json would do) for further processing down the line.
Unfortunately, when I attempt to read https://github.com/fukamachi/woo/blob/master/src/woo.lisp, I get the error
There is no package with the name WOO.EV.TCP
which is of course coming from line 80 of that file, since that package is defined in src/ev/tcp.lisp, and we haven't read it.
Basically, is it possible to just read the file into sexps without caring whether the packages are defined or if they contain the relevant symbols? If so, how? I've tried looking at the hyperspec reader documentation, but I don't see anything that sounds relevant.
I'm out of practice with actually writing common lisp, but it seems potentially possible to hack around this by handling the undefined package condition by creating a blank package with that name, and handling the no-symbol-of-that-name-in-package condition by just interning a given symbol. I think. I don't know how to actually do this, I don't know if it would work, I don't know how many special cases would be involved. Offhand, the first condition is called no-such-package, but the second one (at least in sbcl) is called simple-error, so I don't even know how to determine whether this particular simple-error is the no-such-symbol-in-that-package error, let alone how to extract the relevant names from the condition, fix it, and restart. I'd really like to hear from a common lisp expert that this is the right thing to do here before I go down the road of trying to do it this way, because it will involve a lot of learning.
It also occurs to me that I could fix this by just sed-ing the file before reading it. E.g. turning woo.ev.tcp:start-listening-socket into, say, woo.ev.tcp===start-listening-socket. I don't particularly like this solution, and it's not clear that I wouldn't run into tons more ugly special cases, but it might work if there's no better answer.
I am almost sure there is no easy portable way to do this for a number of reasons.
(Just limiting things to the non-existent-package problem for now.)
First of all there is no portable access into the bit of the reader which decides that tokens are going to be symbols and then looks for package markers &c: that just happens according to the rules in 2.3. So you can't easily intervene in this.
Secondly there's not portably enough information in any kind of condition the reader might signal to be able to handle them.
There are several possible ways out of this bit of the problem.
If you felt sufficiently heroic you might be able to teach the reader that all of the token-starting characters are in fact things you control and then write a token-reader that somehow deals with the whole package thing by returning some object which isn't a symbol. But to do that you need to deal with numbers, and if you think that's simple, well, it's not.
If you felt less heroic you could write a more primitive token-reader which just doesn't even try to deal with anything except grabbing all the characters needed and returns some kind of object which wraps a string. This would avoid the whole number problem at the cost of losing a lot of intofmration.
If you don't care about portability, find an implementation, understand how its reader does it, and muck around with it. There are more open source or source-available implementations than I can easily count (perhaps I am not very good at counting) so this is a pretty good approach. It's certainly what I'd do.
But this is only the start of the problems. The CL reader is hairy and, in its standard configuration (the configuration which is used for things like compile-file unless people have arranged otherwise) can run completely arbitrary code at read time, including code which modifies the reader itself, some of which may do so in an implementation-dependent way. And people use this: there's a reason Lisp is called the 'programmable programming language' and it's that people program it.
I've decided to solve this using sed (actually Python's re.sub, but who's counting?) because it'll work for my actual use case, and was easy.
For future readers: The various people saying this is impossible in general are probably right. The other questions posted by #Svante look like good easy ways to solve part of the problem. Other parts of the problem might be solved more elegantly by replacing the reader macros for #., #+, #-, etc with ones which just make a list, which sounds less heroic than the suggestions from #tfb, but I don't have time for that shit.

How to use neo4j effectively for serious, repeatable analysis over time

New to neo4J and love the browser for exploratory work. But, I'm unsure of how to best use it to achieve, for lack of a better term, real work. Consider a sample project involving:
Importing 4 different CSV files
Creating appropriate relationships between nodes
Doing a variety of complex queries to derive data that I'll export for statistical analysis using another program.
I need to be able to replicate the project in the future, as well as adding new data, calculating different derived data, etc. I also need to be able to share the code so others can extend/verify it.
For non-relational data, I'd use something like R, Stata or SAS. While each allow interactive exploration like the neo4J browser, I'd never use that for serious analysis. Instead, I'd save a file or files of commands that I could modify and rerun whenever I needed to.
Neo4j's browser doesn't seem to support any of this functionality. Unless I am missing something, it doesn't even allow one to save a "session" along the lines of a iPython/Jupyter notebook. I know that there is a neo4-shell, but especially since they have dropped it from the standard desktop installation (and gotten rid of the console), I feel like I must be doing something wrong--or at least contrary to the designers' intent--if I can't do serious work in the browser. Clearly, lots of people are.
Can anyone point me in the right direction? How does one best develop an extensive, replicable project over time with neo4j? Thank you.
You can take your pick of several officially-supported language drivers to integrate neo4j into basically any other project structure, including Jupyter notebooks. I'm not sure what exactly you mean by "serious work", or where you got the idea that people did lots of it in the browser, but you are definitely able to save the results of a query from the browser in a variety of formats (pictures of the bubbles, result rows in a CSV, JSON response) if your prefer to work that way, or you can pipe data very efficiently into another language and manage it there. I don't see why they would re-create presentation and/or project management tools when there are already so many good ones out there.

Best way to link specific labels in the test of a webpage to a specific client login

I have a website I am developing that will be deployed to several different clients. All of the functionality is the same and the vast majority of the language used is the same. However, some of the clients are in different industries so specific words and phrases within some pages need to changed based off of the company of the individual logged into the site. What is the best way to accomplish this?
In the past I have seen people use string database tables but that seems rather cumbersome. I thought about using localization but I don't want another developer to get confused because it isn't a change in spoken languages.
For this you can use something like a word list. I don't know whether word list is a well know concept or not but let me try to explain it to you.
You can add the information that distinguishes each login from other based on the companies in one table in your database and map it to corresponding words you wanna use for the respective English or default word in another table.
Now I am assuming that these words do not change very often. So what you can do is on application start, load it to a convenient memory data structure.
Now all the text you want to process will go through a word list processor which is basically a program code that identifies the group in which the login is and identifies the words to be replaced. Then it replaces those words based on the appropriate group and returns back the transformed text which you can display in the UI.
So here the advantage is, once the data is loaded into the memory data structure, you don't need to read the values from your DB.
Moreover, if there is any change in the word lists or if you want to give user the handle to change the words according to their preference, you can directly modify the memory data structure and then later refresh it in the DB asynchronously.
Also since the call for mapping is directly from the memory, its faster than DB calls.
And since its a program code, typically a method or something, its totally up to you which text to process and which to ignore.
This is a technique which we used in our application when we had a similar requirement. I hope this suggestion of solution to this problems helps !
Better alternatives and suggestions are always welcome since we would also want to improve our solution to this problem. Thanks.

Get size of folder and all contents for each user

I have seen in several SAAS sites where people store data and the site tracks how much data is on their servers. They display that information to the user.
Using ruby on rails how do some of these sites do this? It seems like enough sites do it that there must be a standard way I am just not aware of. Or is it everyone pretty much implements their own solution?
If everyone has implemented their own solution then is this a good appraoch using:
`du -s <directory>`
and just parsing the data.
All depends on how the site is storing the actual data, and meta-data about the objects that its users upload to it. It is entirely possible that the Data isn't being stored on a traditional file system (such as S3). So in cases that like du wouldn't work.
So if you store the meta data in the database, including the size, you can just get the sum of the upload size via a query, and not have to have hit the underlying filesystem...
du is a very good tool for this sort of thing, it's probably faster than what you could roll by hand using Find, but you'll need to be careful when parsing the output.
It's not uncommon for directories to contain exotic characters in their names which includes the obvious like spaces, and the unusual like newlines. This makes parsing the output of du somewhat unreliable if someone does this:
% mkdir "foo
1234 bar"
If that's not a big deal, forget about it. Otherwise you'll need to recurse and do the math manually, something that can take a while for the Ruby interpreter on large filesystems.

Things you look for when trying to understand new software

I wonder what sort of things you look for when you start working on an existing, but new to you, system? Let's say that the system is quite big (whatever it means to you).
Some of the things that were identified are:
Where is a particular subroutine or procedure invoked?
What are the arguments, results and predicates of a particular function?
How does the flow of control reach a particular location?
Where is a particular variable set, used or queried?
Where is a particular variable declared?
Where is a particular data object accessed, i.e. created, read, updated or deleted?
What are the inputs and outputs of a particular module?
But if you look for something more specific or any of the above questions is particularly important to you please share it with us :)
I'm particularly interested in something that could be extracted in dynamic analysis/execution.
I like to use a "use case" approach:
First, I ask myself "what's this software's purpose?": I try to identify how users are going to interact with the application;
Once I have some "use case", I try to understand what are the objects that are more involved and how they interact with other objects.
Once I did this, I draw a UML-type diagram that describe what I've just learned for further reference. What happens after depends on the task I've been assigned, i.e. modify the code, document the code etc.
There is the question of what motivation do I have for learning the new system:
Bug fix/minor enhancement - In this case, I may focus solely on that portion of the system that performs a specific function that needs to be altered. This is a way to break down a huge system but also is a way to identify if the issue is something I can fix or if it is something that I have to hand to the off-the-shelf company whose software we are using,e.g. a CRM, CMS, or ERP system can be a customized off-the-shelf system so there are many pieces to it.
Project work - This would be the other case and is where I'd probably try to build myself a view from 30,000 feet or so to know what are the high-level components and which areas of the system does the project impact. An example of this is where I'd join a company and work off of an existing code base but I don't have the luxury of having the small focus like in the previous case. Part of that view is to look for any patterns in the code in terms of naming conventions, project structure, etc. as this may be useful once I start changing some code in the system. I'd probably do some tracing through the system and try to see where are the uglier parts of the code. By uglier I mean those parts that are kludge-like and may have some spaghetti code as this was rushed when first written and is now being reworked heavily.
To my mind another way to view this is the question of whether I'm going to be spending days or weeks wrapping my head around a system like in the second case or should this be a case where it hopefully takes only a few hours, optimistically that is, to get my footing to make the necessary changes.

Resources