How to generate unique (short) URL folder name on the fly...like Bit.ly - url

I'm creating an application which will create a large number of folders on a web server, with files inside of them.
I need the folder name to be unique. I can easily do this with a GUID, but I want something more user friendly. It doesn't need to be speakable by users, but should be short and standard characters (alphas is best).
In short: i'm looking to do something like Bit.ly does with their unique names:
www.mydomain.com/ABCDEF
Is there a good reference on how to do this? My platform will be .NET/C#, but ok with any help, references, links, etc on the general concept, or any overall advice to solve this task.

Start at 1. Increment to 2, 3, 4, 5, 6, 7,
8, 9, a, b...
A, B, C...
X, Y, Z, 10, 11, 12, ... 1a, 1b,
You get the idea.
You have a synchronized global int/long "next id" and represent it in base 62 (numbers, lowercase, caps) or base 36 or something.

I'm assuming that you know how to use your web server's redirect capabilities. If you need help, just comment :).
The way I would do it would be generating a random integer (between the integer values of 'a' and 'z'); converting it into a char; appending it to a string; and repeating until we reach the needed length. If it generates a value already in the database, repeat the process. If it was unique, store it in the database with the name of the actual location and the name of the alias.
This is a bit hack-like because it assumes that 'a' through 'z' are actually in sequence in their integer values.
Best I could think of :(.

In Perl, without modules so you can translate more easly.
sub convert_to_base {
my ($n, $b) = #_;
my #digits;
while ($n) {
my $digits = $n % $b;
unshift #digits, $digit;
$n = ($n - $digit) / $b;
}
unshift #digits, 0 if !#digits;
return #digits;
}
# Whatever characters you want to use.
my #digit_set = ( '0'..'9', 'a'..'z', 'A'..'Z' );
# The id of the record in the database,
# or one more than the last id you generated.
my $id = 1;
my $converted =
join '',
map { $digit_set[$_] }
convert_to_base($id, 0+#digits_set);

I needed something similar to what you're trying to accomplish. I retooled my code to generate folders so try this. It's setup for a console app, but you can use it in a website also.
private static void genRandomFolders()
{
string basepath = "C:\\Users\\{username here}\\Desktop\\";
int count = 5;
int length = 8;
List<string> codes = new List<string>();
int total = 0;
int i = count;
Random rnd = new Random();
while (i-- > 0)
{
string code = RandomString(rnd, length);
if (!codes.Exists(delegate(string c) { return c.ToLower() == code.ToLower(); }))
{
//Create directory here
System.IO.Directory.CreateDirectory(basepath + code);
}
total++;
if (total % 100 == 0)
Console.WriteLine("Generated " + total.ToString() + " random folders...");
}
Console.WriteLine();
Console.WriteLine("Generated " + total.ToString() + " total random folders.");
}
public static string RandomString(Random r, int len)
{
//string str = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890"; //uppercase only
//string str = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890"; //All
string str = "abcdefghjkmnpqrstuvwxyz123456789"; //Lowercase only
StringBuilder sb = new StringBuilder();
while ((len--) > 0)
sb.Append(str[(int)(r.NextDouble() * str.Length)]);
return sb.ToString();
}

Related

Getting a specific number from a bigger number?

a = (any random number)
Is there a way could I get the third number (12345) without converting it into a string?
If not, what would be a good way to get it by converting it into a string?
You can use the sub function
function getDigit(value,digitPlace)
return tonumber(tostring(value):sub(digitPlace,digitPlace))
end
This will get the third digit of a as a number:
a = 12345
print(getDigit(a,3))
You can get that using simple maths.
function getDigitInt(value, digit)
-- get rid of the sign
value = math.abs(value)
-- how many digits does the number have?
local numDigits = math.floor(math.log(value, 10)) + 1
-- does the requested digit exist?
if digit > numDigits or digit < 1 then
print("digit does not exist")
return
end
-- return the requested digit
return math.floor(value / 10^(numDigits - digit)) % 10
end
-- test
for i = 0, 8 do print(getDigitInt(1234567, i)) end
Add more error handling as needed. Also this can only handle integers of course. But I'm sure you will find out how to apply this idea to decimals as well.
You can convert the number into array and find the any place easily like blow
public int GetDigitsPlace(int number, int digitPlace) {
string t = number.ToString();
int[] nArr = new int[t.Length];
for(int i = 0; i < nArr.Length; i++) {
nArr[i] = int.Parse(t[i]);
}
return nArr[digitPlace];
}

How to find the number of documents (and fraction) per topic using LDA?

I am trying to extract topic from 7 millons of Twitter data. I have assumed each tweet as a document. So, I stored all tweets in a file where each line (or tweet) treated as a document. I used this file as a input file for Mallet api.
public static void LDAModel(int numofK,int numbofIteration,int numberofThread,String outputDir,InstanceList instances) throws Exception
{
// Create a model with 100 topics, alpha_t = 0.01, beta_w = 0.01
// Note that the first parameter is passed as the sum over topics, while
// the second is the parameter for a single dimension of the Dirichlet prior.
int numTopics = numofK;
ParallelTopicModel model = new ParallelTopicModel(numTopics, 1.0, 0.01);
model.addInstances(instances);
// Use two parallel samplers, which each look at one half the corpus and combine
// statistics after every iteration.
model.setNumThreads(numberofThread);
// Run the model for 50 iterations and stop (this is for testing only,
// for real applications, use 1000 to 2000 iterations)
model.setNumIterations(numbofIteration);
model.estimate();
// Show the words and topics in the first instance
// The data alphabet maps word IDs to strings
Alphabet dataAlphabet = instances.getDataAlphabet();
FeatureSequence tokens = (FeatureSequence) model.getData().get(0).instance.getData();
LabelSequence topics = model.getData().get(0).topicSequence;
Formatter out = new Formatter(new StringBuilder(), Locale.US);
for (int position = 0; position < tokens.getLength(); position++) {
// out.format("%s-%d ", dataAlphabet.lookupObject(tokens.getIndexAtPosition(position)), topics.getIndexAtPosition(position));
out.format("%s-%d ", dataAlphabet.lookupObject(tokens.getIndexAtPosition(position)), topics.getIndexAtPosition(position));
}
System.out.println(out);
// Estimate the topic distribution of the first instance,
// given the current Gibbs state.
double[] topicDistribution = model.getTopicProbabilities(0);
// Get an array of sorted sets of word ID/count pairs
ArrayList<TreeSet<IDSorter>> topicSortedWords = model.getSortedWords();
// Show top 10 words in topics with proportions for the first document
String topicsoutput="";
for (int topic = 0; topic < numTopics; topic++) {
Iterator<IDSorter> iterator = topicSortedWords.get(topic).iterator();
out = new Formatter(new StringBuilder(), Locale.US);
out.format("%d\t%.3f\t", topic, topicDistribution[topic]);
int rank = 0;
while (iterator.hasNext() && rank < 10) {
IDSorter idCountPair = iterator.next();
out.format("%s (%.0f) ", dataAlphabet.lookupObject(idCountPair.getID()), idCountPair.getWeight());
//out.format("%s ", dataAlphabet.lookupObject(idCountPair.getID()));
rank++;
}
System.out.println(out);
}
// Create a new instance with high probability of topic 0
StringBuilder topicZeroText = new StringBuilder();
Iterator<IDSorter> iterator = topicSortedWords.get(0).iterator();
int rank = 0;
while (iterator.hasNext() && rank < 10) {
IDSorter idCountPair = iterator.next();
topicZeroText.append(dataAlphabet.lookupObject(idCountPair.getID()) + " ");
rank++;
}
// Create a new instance named "test instance" with empty target and source fields.
InstanceList testing = new InstanceList(instances.getPipe());
testing.addThruPipe(new Instance(topicZeroText.toString(), null, "test instance", null));
TopicInferencer inferencer = model.getInferencer();
double[] testProbabilities = inferencer.getSampledDistribution(testing.get(0), 10, 1, 5);
System.out.println("0\t" + testProbabilities[0]);
File pathDir = new File(outputDir + File.separator+ "NumofTopics"+numTopics); //FIXME replace all strings with constants
pathDir.mkdir();
String DirPath = pathDir.getPath();
String stateFile = DirPath+File.separator+"output_state.gz";
String outputDocTopicsFile = DirPath+File.separator+"output_doc_topics.txt";
String topicKeysFile = DirPath+File.separator+"output_topic_keys";
PrintWriter writer=null;
String topicKeysFile_fromProgram = DirPath+File.separator+"output_topic";
try {
writer = new PrintWriter(topicKeysFile_fromProgram, "UTF-8");
writer.print(topicsoutput);
writer.close();
} catch (Exception e) {
e.printStackTrace();
}
model.printTopWords(new File(topicKeysFile), 11, false);
model.printDocumentTopics(new File (outputDocTopicsFile));
model.printState(new File (stateFile));
}
public static void main(String[] args) throws Exception{
// Begin by importing documents from text to feature sequences
ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
// Pipes: lowercase, tokenize, remove stopwords, map to features
pipeList.add( new CharSequenceLowercase() );
pipeList.add( new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")) );
pipeList.add( new TokenSequenceRemoveStopwords(new File("H:\\Data\\stoplists\\en.txt"), "UTF-8", false, false, false) );
pipeList.add( new TokenSequence2FeatureSequence() );
InstanceList instances = new InstanceList (new SerialPipes(pipeList));
Reader fileReader = new InputStreamReader(new FileInputStream(new File("E:\\Thesis Data\\DataForLDA\\freshnewData\\cleanTweets.txt")), "UTF-8");
instances.addThruPipe(new CsvIterator (fileReader, Pattern.compile("^(\\S*)[\\s,]*(\\S*)[\\s,]*(.*)$"),
3, 2, 1)); // data, label, name fields
int numberofTopic=5;
int numberofIteration=50;
int numberofThread=6;
String outputDir="J:\\Topics\\";
//int numberofTopic=5;
LDAModel(numberofTopic,numberofIteration,numberofThread,outputDir,instances);
TimeUnit.SECONDS.sleep(30);
numberofTopic=10; }
I have got three files from the above program.
1. state file
2. topic proportion file
3. key topic list
I would like to find out the number of documents allocated per topic.
For example I got the following output from key topic list file
0.004 obama (5471) canada (5283) woman (5152) vote (4879) police(3965)
where first column means topic serial number, second column means topic weight, third column means words under this topic (number of words)
Here, I got number of words under this topic but I would also like to show the number of documents where I got this topic. It would be helpful to show this output as a separate file like this. For example,
Topic 1: doc1(80%) doc2(70%) .......
Could anyone please give some idea or any source code for this?
Thanks.
The information you are looking for is contained in the file "2. topic proportion" you mentioned. Note that every document contains each topic with some percentage (although the percentages may be large for one topic and extremly small for others). You will have to decide what you want to extract from the file: The dominant topic (it is in column 3); The dominant topic, but only when the percentage is at least 50% (sometimes, two topics have almost the same percentage) ...

SQL CLR User Defined Function (C#) adds null character (\0) in between every existing character in String being returned

This one has kept me stumped for a couple of days now.
It's my first dabble with CLR & UDF ...
I have created a user defined function that takes a multiline String as input, scans it and replaces a certain line in the string with an alternative if found. If it is not found, it simply appends the desired line at the end. (See code)
The problem, it seems, comes when the final String (or Stringbuilder) is converted to an SqlString or SqlChars. The converted, returned String always contains the Nul character as every second character (viewing via console output, they are displayed as spaces).
I'm probably missing something fundamental on UDF and/or CLR.
Please Help!!
Code (I leave in the commented Stringbuilder which was my initial attempt... changed to normal String in a desperate attempt to find the issue):
[Microsoft.SqlServer.Server.SqlFunction]
[return: SqlFacet(MaxSize = -1, IsFixedLength = false)]
//public static SqlString udf_OmaChangeJob(String omaIn, SqlInt32 jobNumber) {
public static SqlChars udf_OmaChangeJob(String omaIn, SqlInt32 jobNumber) {
if (omaIn == null || omaIn.ToString().Length <= 0) return new SqlChars("");
String[] lines = Regex.Split(omaIn.ToString(), "\r\n");
Regex JobTag = new Regex(#"^JOB=.+$");
//StringBuilder buffer = new StringBuilder();
String buffer = String.Empty;
bool matched = false;
foreach (var line in lines) {
if (!JobTag.IsMatch(line))
//buffer.AppendLine(line);
buffer += line + "\r\n";
else {
//buffer.AppendLine("JOB=" + jobNumber);
buffer += ("JOB=" + jobNumber + "\r\n");
matched = true;
}
}
if (!matched) //buffer.AppendLine("JOB=" + jobNumber);
buffer += ("JOB=" + jobNumber) + "\r\n";
//return new SqlString(buffer.ToString().Replace("\0",String.Empty)) + "blablabla";
// buffer = buffer.Replace("\0", "|");
return new SqlChars(buffer + "\r\nTheEnd");
}
I know in my experiences, the omaIn parameter should be of type SqlString and when you go to collect its value/process it, set a local variable:
string omaString = omaIn != SqlString.Null ? omaIn.Value : string.empty;
Then when you return on any code path, to rewrap the string in C#, you'd need to set
return omaString == string.empty ? new SqlString.Null : new SqlString(omaString);
I have had some fun wrestling matches learning the intricate hand-off between local and outbound types, especially with CLR TVFs.
Hope that can help!

What grammar is this?

I have to parse a document containing groups of variable-value-pairs which is serialized to a string e.g. like this:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
Here are the different elements:
Group IDs:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
Length of string representation of each group:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
One of the groups:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14 ^VAR1^6^VALUE1^^
Variables:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
Length of string representation of the values:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
The values themselves:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
Variables consist only of alphanumeric characters.
No assumption is made about the values, i.e. they may contain any character, including ^.
Is there a name for this kind of grammar? Is there a parsing library that can handle this mess?
So far I am using my own parser, but due to the fact that I need to detect and handle corrupt serializations the code looks rather messy, thus my question for a parser library that could lift the burden.
The simplest way to approach it is to note that there are two nested levels that work the same way. The pattern is extremely simple:
id^length^content^
At the outer level, this produces a set of groups. Within each group, the content follows exactly the same pattern, only here the id is the variable name, and the content is the variable value.
So you only need to write that logic once and you can use it to parse both levels. Just write a function that breaks a string up into a list of id/content pairs. Call it once to get the groups, and then loop through them calling it again for each content to get the variables in that group.
Breaking it down into these steps, first we need a way to get "tokens" from the string. This function returns an object with three methods, to find out if we're at "end of file", and to grab the next delimited or counted substring:
var tokens = function(str) {
var pos = 0;
return {
eof: function() {
return pos == str.length;
},
delimited: function(d) {
var end = str.indexOf(d, pos);
if (end == -1) {
throw new Error('Expected delimiter');
}
var result = str.substr(pos, end - pos);
pos = end + d.length;
return result;
},
counted: function(c) {
var result = str.substr(pos, c);
pos += c;
return result;
}
};
};
Now we can conveniently write the reusable parse function:
var parse = function(str) {
var parts = {};
var t = tokens(str);
while (!t.eof()) {
var id = t.delimited('^');
var len = t.delimited('^');
var content = t.counted(parseInt(len, 10));
var end = t.counted(1);
if (end !== '^') {
throw new Error('Expected ^ after counted string, instead found: ' + end);
}
parts[id] = content;
}
return parts;
};
It builds an object where the keys are the IDs (or variable names). I'm asuming as they have names that the order isn't significant.
Then we can use that at both levels to create the function to do the whole job:
var parseGroups = function(str) {
var groups = parse(str);
Object.keys(groups).forEach(function(id) {
groups[id] = parse(groups[id]);
});
return groups;
}
For your example, it produces this object:
{
'1': {
VAR1: 'VALUE1'
},
'4': {
VAR1: 'VALUE1',
VAR2: 'VAL2'
}
}
I don't think it's a trivial task to create a grammar for this. But on the other hand, a simple straight forward approach is not that hard. You know the corresponding string length for every critical string. So you just chop your string according to those lengths apart..
where do you see problems?

how do i decode, change, then re-encode a CORBA IOR file (Visibroker) in my Java client code?

I am writing code to ingest the IOR file generated by the team responsible for the server and use it to bind my client to their object. Sounds easy, right?
For some reason a bit beyond my grasp (having to do with firewalls, DMZs, etc.), the value for the server inside the IOR file is not something we can use. We have to modify it. However, the IOR string is encoded.
What does Visibroker provide that will let me decode the IOR string, change one or more values, then re-encode it and continue on as normal?
I've already looked into IORInterceptors and URL Naming but I don't think either will do the trick.
Thanks in advance!
When you feel like you need to hack an IOR, resist the urge to do so by writing code and whatnot to mangle it to your liking. IORs are meant to be created and dictated by the server that contains the referenced objects, so the moment you start mucking around in there, you're kinda "voiding your warranty".
Instead, spend your time finding the right way to make the IOR usable in your environment by having the server use an alternative hostname when it generates them. Most ORBs offer such a feature. I don't know Visibroker's particular configuration options at all, but a quick Google search revealed this page that shows a promising value:
vbroker.se.iiop_ts.host
Specifies the host name used by this server engine.
The default value, null, means use the host name from the system.
Hope that helps.
Long time ago I wrote IorParser for GNU Classpath, the code is available. It is a normal parser written being aware about the format, should not "void a warranty" I think. IOR contains multiple tagged profiles that are encapsulated very much like XML so we could parse/modify profiles that we need and understand and leave the rest untouched.
The profile we need to parse is TAG_INTERNET_IOP. It contains version number, host, port and object key. Code that reads and writes this profile can be found in gnu.IOR class. I am sorry this is part of the system library and not a nice piece of code to copy paste here but it should not be very difficult to rip it out with a couple of dependent classes.
This question has been repeatedly asked as CORBA :: Get the client ORB address and port with use of IIOP
Use the FixIOR tool (binary) from jacORB to patch the address and port of an IOR. Download the binary (unzip it) and run:
fixior <new-address> <new-port> <ior-file>
The tool will override the content of the IOR file with the 'patched' IOR
You can use IOR Parser to check the resulting IOR and compare it to your original IOR
Use this function to change the IOR. pass stringified IOR as first argument.
void hackIOR(const char* str, char* newIOR )
{
size_t s = (str ? strlen(str) : 0);
char temp[1000];
strcpy(newIOR,"IOR:");
const char *p = str;
s = (s-4)/2; // how many octets are there in the string
p += 4;
int i;
for (i=0; i<(int)s; i++) {
int j = i*2;
char v=0;
if (p[j] >= '0' && p[j] <= '9') {
v = ((p[j] - '0') << 4);
}
else if (p[j] >= 'a' && p[j] <= 'f') {
v = ((p[j] - 'a' + 10) << 4);
}
else if (p[j] >= 'A' && p[j] <= 'F') {
v = ((p[j] - 'A' + 10) << 4);
}
else
cout <<"invalid octet"<<endl;
if (p[j+1] >= '0' && p[j+1] <= '9') {
v += (p[j+1] - '0');
}
else if (p[j+1] >= 'a' && p[j+1] <= 'f') {
v += (p[j+1] - 'a' + 10);
}
else if (p[j+1] >= 'A' && p[j+1] <= 'F') {
v += (p[j+1] - 'A' + 10);
}
else
cout <<"invalid octet"<<endl;
temp[i]=v;
}
temp[i] = 0;
// Now temp has decoded IOR string. print it.
// Replace the object ID in temp.
// Encoded it back, with following code.
int temp1,temp2;
int l,k;
for(k = 0, l = 4 ; k < s ; k++)
{
temp1=temp2=temp[k];
temp1 &= 0x0F;
temp2 = temp2 & 0xF0;
temp2 = temp2 >> 4;
if(temp2 >=0 && temp2 <=9)
{
newIOR[l++] = temp2+'0';
}
else if(temp2 >=10 && temp2 <=15)
{
newIOR[l++] = temp2+'A'-10;
}
if(temp1 >=0 && temp1 <=9)
{
newIOR[l++] = temp1+'0';
}
else if(temp1 >=10 && temp1 <=15)
{
newIOR[l++] = temp1+'A'-10;
}
}
newIOR[l] = 0;
//new IOR is present in new variable newIOR.
}
Hope this works for you.

Resources