Convert unicode to characters in a file using Ruby

Convert unicode to characters in a file using Ruby - ruby-on-rails

I have this string in a code.txt file.
"class Solution {\u000Apublic:\u000A vector\u003Cvector\u003Cint\u003E\u003E insert(vector\u003Cvector\u003Cint\u003E\u003E\u0026 intervals, vector\u003Cint\u003E\u0026 newInterval) {\u000A int len \u003D intervals.size()\u003B\u000A int index \u003D 0\u003B\u000A vector\u003Cvector\u003Cint\u003E \u003E ans\u003B\u000A \u000A\u000A while(index \u003C len \u0026\u0026 intervals[index][1] \u003C newInterval[0]) ans.push_back(intervals[index++])\u003B\u000A \u000A while(index \u003C len \u0026\u0026 intervals[index][0] \u003C\u003D newInterval[1]) {\u000A newInterval[0] \u003D min(intervals[index][0], newInterval[0])\u003B\u000A newInterval[1] \u003D max(intervals[index][1], newInterval[1])\u003B\u000A index++\u003B\u000A }\u000A \u000A ans.push_back(newInterval)\u003B\u000A \u000A while(index \u003C len) ans.push_back(intervals[index++])\u003B\u000A\u000A return ans\u003B \u000A }\u000A}\u003B "
I would like to convert this string to C++ syntex and write to solution.cpp file.
The content in solution.cpp will look like..
class Solution {
public:
vector<vector<int>> insert(vector<vector<int>>& intervals, vector<int>& newInterval) {
int len = intervals.size();
int index = 0;
vector<vector<int> > ans;
while(index < len && intervals[index][1] < newInterval[0]) ans.push_back(intervals[index++]);
while(index < len && intervals[index][0] <= newInterval[1]) {
newInterval[0] = min(intervals[index][0], newInterval[0]);
newInterval[1] = max(intervals[index][1], newInterval[1]);
index++;
}
ans.push_back(newInterval);
while(index < len) ans.push_back(intervals[index++]);
return ans;
}
};
I have tried enforcing/converting encoding to UTF-8 but the string stays the same.
code = File.read('code.txt')
code = code.encode('UTF-8')
file = File.open('solution.cpp', "w:UTF-8")
file.write(code)
How can I do this? Thank you.

So, I have tried to reproduce your problem and got the same result as described by using your solution.
I have noticed that \u003B (for example) is a unicode code for semicolon character. So, I analyzed the string for each "U+" notation using regex /\\u(.{4})/, as it marks "hexadecimal digits" as being Unicode code points. Then used gsub! and Array#pack to convert and substitute each of the Unicode chars.
[$1.to_i(16)].pack('U') # => "\n", "\n", "<", "&", "\n", "=" ...etc.
And finally wrote the result to a file. So, my final approach looks like this:
code = File.read('code.txt')
code.gsub!(/\\u(.{4})/) do |match|
[$1.to_i(16)].pack('U')
end
File.open('solution.cpp', 'w') { |f| f.puts code.gsub!(/\A"|"\Z/, '') }
Also note, I have used gsub again at the end, to search for the leading or trailing quote and replace it with an empty string when writing to a file.

Related

Nom Parser To Unescape String

I'm writing a Nom parser for RCS. RCS Files tend to be ISO-8859-1 encoded. One of the grammar productions is for a String. This is #-delimited and literal # symbols are escaped as ##.
#A String# -> A String
#A ## String# -> A # String
I have a working function (see end). IResult is from Nom, you either return the parsed thing, plus the rest of the unparsed input, or an Error/Incomplete. Cow is used to return a reference built on the original input slice if no unescaping was required, or an owned string if it was.
Are there any built in Nom macros that could have helped with this parse?
#[macro_use]
extern crate nom;
use std::str;
use std::borrow::Cow;
use nom::*;
/// Parse an RCS String
fn string<'a>(input: &'a[u8]) -> IResult<&'a[u8], Cow<'a, str>> {
let len = input.len();
if len < 1 {
return IResult::Incomplete(Needed::Unknown);
}
if input[0] != b'#' {
return IResult::Error(Err::Code(ErrorKind::Custom(0)));
}
// start of current chunk. Chunk is a piece of unescaped input
let mut start = 1;
// current char index in input
let mut i = start;
// FIXME only need to allocate if input turned out to need unescaping
let mut s: String = String::new();
// Was the input escaped?
let mut escaped = false;
while i < len {
// Check for end delimiter
if input[i] == b'#' {
// if there's another # then it is an escape sequence
if i + 1 < len && input[i + 1] == b'#' {
// escaped #
i += 1; // want to include the first # in the output
s.push_str(str::from_utf8(&input[start .. i]).unwrap());
start = i + 1;
escaped = true;
} else {
// end of string
let result = if escaped {
s.push_str(str::from_utf8(&input[start .. i]).unwrap());
Cow::Owned(s)
} else {
Cow::Borrowed(str::from_utf8(&input[1 .. i]).unwrap())
};
return IResult::Done(&input[i + 1 ..], result);
}
}
i += 1;
}
IResult::Incomplete(Needed::Unknown)
}

It looks like the way to use the nom library is using the macro combinators. A quick browse of the source code gives some nice examples of parsers, including parsing of strings with escape characters. This is what I came up with:
#[macro_use]
extern crate nom;
use nom::*;
named!(string< Vec<u8> >, delimited!(
tag!("#"),
fold_many0!(
alt!(
is_not!(b"#") |
map!(
complete!(tag!("##")),
|_| &b"#"[..]
)
),
Vec::new(),
|mut acc: Vec<u8>, bytes: &[u8]| {
acc.extend(bytes);
acc
}
),
tag!("#")
));
#[test]
fn it_works() {
assert_eq!(string(b"#string#"), IResult::Done(&b""[..], b"string".to_vec()));
assert_eq!(string(b"#string with ## escapes#"), IResult::Done(&b""[..], b"string with # escapes".to_vec()));
assert_eq!(string(b"#invalid string"), IResult::Incomplete(Needed::Size(16)));
}
As you can see, I simply copy the bytes into a vector using Vec::extend - you could be more sophisticated here and return a Cow byte slice if you wanted.
The escaped! macro does not appear to be of use in this case unfortunately, as it can't seem to work when the terminator is the same as the escape character (which is actually a pretty common case).

Ask user for path for fopen in C?

This is my function. It's working absolutely fine; I just can't get one more thing working.
Instead of the static fopen paths, I need the user to write the path for the files. I tried several things but I can't get it working. Please help
int FileToFile() {
FILE *fp;
FILE *fp_write;
char line[128];
int max=0;
int countFor=0;
int countWhile=0;
int countDo = 0;
fp = fopen("d:\\text.txt", "r+");
fp_write = fopen("d:\\results.txt", "w+");
if (!fp) {
perror("Greshka");
}
else {
while (fgets(line, sizeof line, fp) != NULL) {
countFor = 0;
countWhile = 0;
countDo = 0;
fputs(line, stdout);
if (line[strlen(line)-1] = "\n") if (max < (strlen(line) -1)) max = strlen(line) -1;
else if (max < strlen(line)) max = strlen(line);
char *tmp = line;
while (tmp = strstr(tmp, "for")){
countFor++;
tmp++;
}
tmp = line;
while (tmp = strstr(tmp, "while")){
countWhile++;
tmp++;
}
tmp = line;
while (tmp = strstr(tmp, "do")){
countDo++;
tmp++;
}
fprintf(fp_write, "Na tozi red operatora for go ima: %d pyti\n", countFor);
fprintf(fp_write, "Na tozi red operatora for/while go ima: %d pyti\n", countWhile - countDo);
fprintf(fp_write, "Na tozi red operatora do go ima: %d pyti\n", countDo);
}
fprintf(fp_write, "Maximalen broi simvoli e:%d\n", max);
fclose(fp_write);
fclose(fp);
}
}

Have a look at argc and argv. They are used for command-line arguments passed to a program. This requires that your main function be revised as follows:
int main(int argc, char *argv[])
The argc is an integer that represents the number of command-like arguments, and argv is an array of char* that contain the arguments themselves. Note that for both, the program name itself counts as an argument.
So if you invoke your program like this:
myprog c:\temp
Then argc will be 2, argv[0] will be myprog, and argv[1] will be c:\temp. Now you can just pass the strings to your function. If you pass more arguments, they will be argv[2], etc.
Keep in mind if your path contains spaces, you must enclose it in double quotes for it to be considered one argument, because space is used as a delimiter:
myprog "c:\path with spaces"

What standard produced hex-encoded characters with an extra "25" at the front?

I'm trying to integrate with ybp.com, a vendor of proprietary software for managing book ordering workflows in large libraries. It keeps feeding me URLs that contain characters encoded with an extra "25" in them. Like this book title:
VOLATILE KNOWING%253a PARENTS%252c TEACHERS%252c AND THE CENSORED STORY OF ACCOUNTABILITY IN AMERICA%2527S PUBLIC SCHOOLS.
The encoded characters in this sample are as follows:
%253a = %3A = a colon
%252c = %2C = a comma
%2527 = %27 = an apostrophe (non-curly)
I need to convert these encodings to a format my internal apps can recognize, and the extra 25 is throwing things off kilter. The final two digits of the hex encoded characters appear to be identical to standard URL encodings, so a brute force method would be to replace "%25" with "%". But I'm leary of doing that because it would be sure to haunt me later when an actual %25 shows up for some reason.
So, what standard is this? Is there an official algorithm for converting values like this to other encodings?

%25 is actually a % character. My guess is that the external website is URLEncoding their output twice accidentally.
If that's the case, it is safe to replace %25 with % (or just URLDecode twice)

The ASCII code 37 (25 in hexadecimal) is %, so the URL encoding of % is %25.
It looks like your data got URL encoded twice: , -> %2C -> %252C
Substituting every %25 for % should not generate any problems, as an actual %25 would get encoded to %25252525.

Create a counter that increments one by one for next two characters, and if you found modulus, you go back, assign the previous counter the '%' char and proceed again. Something like this.
char *str, *newstr; // Fill up with some memory before proceeding below..
....
int k = 0, j = 0;
short modulus = 0;
char first = 0, second = 0;
short proceed = 0;
for(k=0,j=0; k<some_size; j++,k++) {
if(str[k] == '%') {
++k; first = str[k];
++k; second = str[k];
proceed = 1;
} else if(modulus == 1) {
modulus = 0;
--j; first = str[k];
++k; second = str[k];
newstr[j] = '%';
proceed = 1;
} else proceed = 0; // Do not do decoding..
if(proceed == 1) {
if(first == '2' && second == '5') {
newstr[j] = '%';
modulus = 1;
......

how do i decode, change, then re-encode a CORBA IOR file (Visibroker) in my Java client code?

I am writing code to ingest the IOR file generated by the team responsible for the server and use it to bind my client to their object. Sounds easy, right?
For some reason a bit beyond my grasp (having to do with firewalls, DMZs, etc.), the value for the server inside the IOR file is not something we can use. We have to modify it. However, the IOR string is encoded.
What does Visibroker provide that will let me decode the IOR string, change one or more values, then re-encode it and continue on as normal?
I've already looked into IORInterceptors and URL Naming but I don't think either will do the trick.
Thanks in advance!

When you feel like you need to hack an IOR, resist the urge to do so by writing code and whatnot to mangle it to your liking. IORs are meant to be created and dictated by the server that contains the referenced objects, so the moment you start mucking around in there, you're kinda "voiding your warranty".
Instead, spend your time finding the right way to make the IOR usable in your environment by having the server use an alternative hostname when it generates them. Most ORBs offer such a feature. I don't know Visibroker's particular configuration options at all, but a quick Google search revealed this page that shows a promising value:
vbroker.se.iiop_ts.host
Specifies the host name used by this server engine.
The default value, null, means use the host name from the system.
Hope that helps.

Long time ago I wrote IorParser for GNU Classpath, the code is available. It is a normal parser written being aware about the format, should not "void a warranty" I think. IOR contains multiple tagged profiles that are encapsulated very much like XML so we could parse/modify profiles that we need and understand and leave the rest untouched.
The profile we need to parse is TAG_INTERNET_IOP. It contains version number, host, port and object key. Code that reads and writes this profile can be found in gnu.IOR class. I am sorry this is part of the system library and not a nice piece of code to copy paste here but it should not be very difficult to rip it out with a couple of dependent classes.
This question has been repeatedly asked as CORBA :: Get the client ORB address and port with use of IIOP

Use the FixIOR tool (binary) from jacORB to patch the address and port of an IOR. Download the binary (unzip it) and run:
fixior <new-address> <new-port> <ior-file>
The tool will override the content of the IOR file with the 'patched' IOR
You can use IOR Parser to check the resulting IOR and compare it to your original IOR

Use this function to change the IOR. pass stringified IOR as first argument.
void hackIOR(const char* str, char* newIOR )
{
size_t s = (str ? strlen(str) : 0);
char temp[1000];
strcpy(newIOR,"IOR:");
const char *p = str;
s = (s-4)/2; // how many octets are there in the string
p += 4;
int i;
for (i=0; i<(int)s; i++) {
int j = i*2;
char v=0;
if (p[j] >= '0' && p[j] <= '9') {
v = ((p[j] - '0') << 4);
}
else if (p[j] >= 'a' && p[j] <= 'f') {
v = ((p[j] - 'a' + 10) << 4);
}
else if (p[j] >= 'A' && p[j] <= 'F') {
v = ((p[j] - 'A' + 10) << 4);
}
else
cout <<"invalid octet"<<endl;
if (p[j+1] >= '0' && p[j+1] <= '9') {
v += (p[j+1] - '0');
}
else if (p[j+1] >= 'a' && p[j+1] <= 'f') {
v += (p[j+1] - 'a' + 10);
}
else if (p[j+1] >= 'A' && p[j+1] <= 'F') {
v += (p[j+1] - 'A' + 10);
}
else
cout <<"invalid octet"<<endl;
temp[i]=v;
}
temp[i] = 0;
// Now temp has decoded IOR string. print it.
// Replace the object ID in temp.
// Encoded it back, with following code.
int temp1,temp2;
int l,k;
for(k = 0, l = 4 ; k < s ; k++)
{
temp1=temp2=temp[k];
temp1 &= 0x0F;
temp2 = temp2 & 0xF0;
temp2 = temp2 >> 4;
if(temp2 >=0 && temp2 <=9)
{
newIOR[l++] = temp2+'0';
}
else if(temp2 >=10 && temp2 <=15)
{
newIOR[l++] = temp2+'A'-10;
}
if(temp1 >=0 && temp1 <=9)
{
newIOR[l++] = temp1+'0';
}
else if(temp1 >=10 && temp1 <=15)
{
newIOR[l++] = temp1+'A'-10;
}
}
newIOR[l] = 0;
//new IOR is present in new variable newIOR.
}
Hope this works for you.

String Split in DXL

I have a string
Ex: "We prefer questions that can be answered; not just discussed "
now i want to split this string from ";"
like
We prefer questions that can be answered
and
not just discussed
is this possible in DXL.
i am learning DXL, so i don't have any idea whether we can split or not.
Note : This is not a home work.

I'm sorry for necroing this post. Being new to DXL I spent some time with the same challenge. I noticed that the implementations available on the have different specifications of "splitting" a string. Loving the Ruby language, I missed an implementation which comes at least close to the Ruby version of String#split.
Maybe my findings will be helpful to anybody.
Here's a functional comparison of
Variant A: niol's implementation (which at a first glance, appears to be the same implementation which is usually found at Capri Soft,
Variant B: PJT's implementation,
Variant C: Brett's implementation and
Variant D: my implementation (which provides the correct functionality imo).
To eliminate structural difference, all implementations were implemented in functions, returning a Skip list or an Array.
Splitting results
Note that all implementations return different results, depending on their definition of "splitting":
string mellow yellow; delimiter ello
splitVariantA returns 1 elements: ["mellow yellow" ]
splitVariantB returns 2 elements: ["m" "llow yellow" ]
splitVariantC returns 3 elements: ["w" "w y" "" ]
splitVariantD returns 3 elements: ["m" "w y" "w" ]
string now's the time; delimiter
splitVariantA returns 3 elements: ["now's" "the" "time" ]
splitVariantB returns 2 elements: ["" "now's the time" ]
splitVariantC returns 5 elements: ["time" "the" "" "now's" "" ]
splitVariantD returns 3 elements: ["now's" "the" "time" ]
string 1,2,,3,4,,; delimiter ,
splitVariantA returns 4 elements: ["1" "2" "3" "4" ]
splitVariantB returns 2 elements: ["1" "2,,3,4,," ]
splitVariantC returns 7 elements: ["" "" "4" "3" "" "2" "" ]
splitVariantD returns 7 elements: ["1" "2" "" "3" "4" "" "" ]
Timing
Splitting the string 1,2,,3,4,, with the pattern , for 10000 times on my machine gives these timings:
splitVariantA() : 406 ms
splitVariantB() : 46 ms
splitVariantC() : 749 ms
splitVariantD() : 1077 ms
Unfortunately, my implementation D is the slowest. Surprisingly, the regular expressions implementation C is pretty fast.
Source code
// niol, modified
Array splitVariantA(string splitter, string str){
Array tokens = create(1, 1);
Buffer buf = create;
int str_index;
buf = "";
for(str_index = 0; str_index < length(str); str_index++){
if( str[str_index:str_index] == splitter ){
array_push_str(tokens, stringOf(buf));
buf = "";
}
else
buf += str[str_index:str_index];
}
array_push_str(tokens, stringOf(buf));
delete buf;
return tokens;
}
// PJT, modified
Skip splitVariantB(string s, string delimiter) {
int offset
int len
Skip skp = create
if ( findPlainText(s, delimiter, offset, len, false)) {
put(skp, 0, s[0 : offset -1])
put(skp, 1, s[offset +1 :])
}
return skp
}
// Brett, modified
Skip splitVariantC (string s, string delim) {
Skip skp = create
int i = 0
Regexp split = regexp "^(.*)" delim "(.*)$"
while (split s) {
string temp_s = s[match 1]
put(skp, i++, s[match 2])
s = temp_s
}
put(skp, i++, s[match 2])
return skp
}
Skip splitVariantD(string str, string pattern) {
if (null(pattern) || 0 == length(pattern))
pattern = " ";
if (pattern == " ")
str = stringStrip(stringSqueeze(str, ' '));
Skip result = create;
int i = 0; // index for searching in str
int j = 0; // index counter for result array
bool found = true;
while (found) {
// find pattern
int pos = 0;
int len = 0;
found = findPlainText(str[i:], pattern, pos, len, true);
if (found) {
// insert into result
put(result, j++, str[i:i+pos-1]);
i += pos + len;
}
}
// append the rest after last found pattern
put(result, j, str[i:]);
return result;
}

Quick join&split I could come up with. Seams to work okay.
int array_size(Array a){
int size = 0;
while( !null(get(a, size, 0) ) )
size++;
return size;
}
void array_push_str(Array a, string str){
int array_index = array_size(a);
put(a, str, array_index, 0);
}
string array_get_str(Array a, int index){
return (string get(a, index, 0));
}
string str_join(string joiner, Array str_array){
Buffer joined = create;
int array_index = 0;
joined += "";
for(array_index = 0; array_index < array_size(str_array); array_index++){
joined += array_get_str(str_array, array_index);
if( array_index + 1 < array_size(str_array) )
joined += joiner;
}
return stringOf(joined)
}
Array str_split(string splitter, string str){
Array tokens = create(1, 1);
Buffer buf = create;
int str_index;
buf = "";
for(str_index = 0; str_index < length(str); str_index++){
if( str[str_index:str_index] == splitter ){
array_push_str(tokens, stringOf(buf));
buf = "";
}else{
buf += str[str_index:str_index];
}
}
array_push_str(tokens, stringOf(buf));
delete buf;
return tokens;
}

If you only split the string once this is how I would do it:
string s = "We prefer questions that can be answered; not just discussed"
string sub = ";"
int offset
int len
if ( findPlainText(s, sub, offset, len, false)) {
/* the reason why I subtract one and add one is to remove the delimiter from the out put.
First print is to print the prefix and then second is the suffix.*/
print s[0 : offset -1]
print s[offset +1 :]
} else {
// no delimiter found
print "Failed to match"
}
You could also use regular expressions refer to the DXL reference manual. It would be better to use regular expressions if you want to split up the string by multiple delimiters such as str = "this ; is an;example"

ACTUALLY WORKS:
This solution will split as many times as needed, or none, if the delimiter doesn't exist in the string.
This is what I have used instead of a traditional "split" command.
It actually skips the creation of an array, and just loops through each string that would be in the array and calls "someFunction" on each of those strings.
string s = "We prefer questions that can be answered; not just discussed"
// for this example, ";" is used as the delimiter
Regexp split = regexp "^(.*);(.*)$"
// while a ";" exists in s
while (split s) {
// save the text before the last ";"
string temp_s = s[match 1]
// call someFunction on the text after the last ";"
someFunction(s[match 2])
// remove the text after the last ";" (including ";")
s = temp_s
}
// call someFunction again for the last (or only) string
someFunction(s)
Sorry for necroing an old post; I just didn't find the other answers useful.

Perhaps someone would find handy this fused solution as well. It splits string in Skip, based on delimiter, which can actually have length more then one.
Skip splitString(string s1, string delimit)
{
int offset, len
Skip splited = create
while(findPlainText(s1, delimit, offset, len, false))
{
put(splited, s1[0:offset-1], s1[0:offset-1])
s1 = s1[offset+length(delimit):length(s1)-1]
}
if(length(s1)>0)
{
put (splited, s1, s1)
}
return splited
}

I tried this out and worked out for me...
string s = "We prefer questions that can be answered,not just discussed,hiyas"
string sub = ","
int offset
int len
string s1=s
while(length(s1)>0){
if ( findPlainText(s1, sub, offset, len, false)) {
print s1[0 : offset -1]"\n"
s1= s1[offset+1:length(s1)]
}
else
{
print s1
s1=""
}
}

Here is a better implementation. This is a recursive split of the string by searching a keyword.
pragma runLim, 10000
string s = "We prefer questions that can be answered,not just discussed,hiyas;
Next Line,Var1,Nemesis;
Next Line,Var2,Nemesis1;
Next Line,Var3,Nemesis2;
New,Var4,Nemesis3;
Next Line,Var5,Nemesis4;
New,Var5,Nemesis5;"
string sub = ","
int offset
int len
string searchkey=null
string curr=s
string nxt=s
string searchline=null
string Modulename=""
string Attributename=""
string Attributevalue=""
while(findPlainText(curr,"Next Line", offset,len,false))
{
int intlen=offset
searchkey=curr[offset:length(curr)]
if(findPlainText(searchkey,"Next Line",offset,len,false))
{
curr=searchkey[offset+1:length(searchkey)]
}
if(findPlainText(searchkey,";",offset,len,false))
{
searchline=searchkey[0:offset]
}
int counter=0
while(length(searchline)>0)
{
if (findPlainText(searchline, sub, offset, len, false))
{
if(counter==0)
{
Modulename=searchline[0 : offset -1]
counter++
}
else if(counter==1)
{
Attributename=searchline[0 : offset -1]
counter++
}
searchline= searchline[offset+1:length(searchline)]
}
else
{
if(counter==2)
{
Attributevalue=searchline[0:length(searchline)-2]
counter++
}
searchline=""
}
}
print "Modulename="Modulename " Attributename=" Attributename " Attributevalue= "Attributevalue "\n"
}

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Convert unicode to characters in a file using Ruby - ruby-on-rails

Related

Nom Parser To Unescape String

Ask user for path for fopen in C?

What standard produced hex-encoded characters with an extra "25" at the front?

how do i decode, change, then re-encode a CORBA IOR file (Visibroker) in my Java client code?

String Split in DXL

Categories

Resources