I am parsing an RSS feed that
returns has the title in this format:
<title>Some Data with spaces / Bahnofstrasse 22</title>
so the first thing I did is to replace the spaces with a +
String result = _items[index].title.replaceAll(new RegExp(r"\s+\b|\b\s"), "+");
however, what I need to do is to select only the text after the / so in this case Bahnofstrasse 22, remove the spaces and add a city (e.g. Zurich) before the address so the desired results would be
Zürich,Bahnofstrasse,22
so essentially select all text after /, add Zurich and replace spaces with ,
If you know that the data will always be separated by a /, you could look at the second element of the split. For example,
final result = "Zürich," +
_items[index].title
.split("/")[1]
.replaceFirst("<", "")
.trim()
.replaceAll(" ", ",");
Regular expressions would only really be useful if, for example, the format of the data were variable.
Related
I am trying to parse the path part of a url.
The input, is a string such as site/whatever% ^&*/page/to-days_date// which I would like to convert into site/whatever/page/to-days_date
Things to remove would be anything that is not one of the following:
lower or upper case letter
digit / number
dash
underscore
Just add /+$ with a pipe(|) with your existing regex. It means match any number(starting from 1) of / from the end of input. So it will work for / // or ///// at the end of the input.
myString = '''blog/whatever% ^&*/page/to-days_date//'''
print re.sub(r'/+$|[^a-zA-Z0-9_\-\/]+', '', myString)
^^^ here
Below is the (crude) method I'm using to export the contents of a table into a CSV. I came up with this on the fly, however the data in the table has been loaded from an Excel spreadsheet created by a Sharepoint site. I do not know if that conversion process or my method is the cause, but a large number of these characters: Â are being imported into the cells.
Also, a large number of records are having their fields split up into two rows as opposed to just one. This is my first attempt at exporting to a CSV programatically (as opposed to using excel) so any help would be greatly apprecitated.
Controller Method
public ActionResult ExportToCsv()
{
using (StringWriter writer = new StringWriter())
{
var banks = db.BankListMaster.Include(b => b.BankListAgentId).ToList();
writer.WriteLine("Bank Name, EPURL, AssociatedTPMBD, Tier, FixedLifeMasterSAF, VariableLifeMasterSAF, FixedLifeSNY, VariableLifeMasterSNY, SpecialNotes, WelcomeLetterReq, " +
"BackOfficeNotification, LinkRepsToDynamics, RelationshipCode, INDSGC, PENSGC, LicensingContract, MiscellaneousNotes, ContentTypeID1, CreatedBy, MANonresBizNY, Attachment");
foreach (var item in banks)
{
writer.Write(item.BankName + ",");
if(String.IsNullOrWhiteSpace(item.EPURL))
{
writer.Write(item.EPURL + ",");
}
else
{
writer.Write(item.EPURL.Trim() + ",");
}
writer.Write(item.AssociatedTPMBD + ",");
writer.Write(item.Tier + ",");
writer.Write(item.LicensingContract + ",");
writer.Write(item.MiscellaneousNotes + ",");
writer.Write(item.ContentTypeID1 + ",");
writer.Write(item.CreatedBy + ",");
writer.Write(item.MANonresBizNY + ",");
writer.Write(item.Attachment);
writer.Write(writer.NewLine);
}
return File(new System.Text.UTF8Encoding().GetBytes(writer.ToString().Replace("Â", "")), "text/csv", "BankList.csv");
}
}
CSV is a file format that's poorly specified. Several important things aren't specified:
The field separator. Even though it's called "comma separated", Excel will use the semicolon sometimes (depending on your locale!).
The encoding (UTF-8, ISO-8859-1, ANSI/Windows-1252 etc.)
The kind of newlines (CR, NL or CR NL).
Whether all fields have to be quoted with double quotes or just the ones containing the field separator, the line separator, blanks etc.
Whether white space is trimmed from unquoted fields.
Whether newlines are allowed within quoted fields (Excel allows them).
How double quotes are escaped if they are part of the field content (normally they are doubled)
Excel is usually the reference for a valid CSV format. But even Excel chooses the field separator and the encoding depending on your locale.
In your case, the encoding is most likely the main problem. You use UTF-8 but the consumer treats it as ISO-8859-1 or ANSI. For that reason, the character  often appears whose binary code is used in UTF-8 to introduce a two byte sequence. Change the encoding to fix the Â.
As the next step, properly quote the text fields, i.e. add double quotes at the start and at the end and double all double quotes within the field.
I have read a multiline file and converted it to a list with the following code:
Lines = string:tokens(erlang:binary_to_list(Binary), "\n"),
I converted it to a string to do some work on it:
Flat = string:join(Lines, "\r\n"),
I finished working on the string and now I need to convert it back to a multiline list, I tried to repeat the first snippet shown above but that never worked, I tried string:join and that didnt work.. how do i convert it back to a list just like it used to be (although now modified)?
Well that depends on the modifications you made on the flattened string.
string:tokens/2 will always explode a string using the separator you provide. So as long as your transformation preserves a specific string as separator between the individual substrings there should be no problem.
However, if you do something more elaborate and destructive in your transformation then the only way is to iterate on the string manually and construct the individual substrings.
Your first snippet above contains a call to erlang:binary_to_list/1 which first converts a binary to a string (list) which you then split with the call to string:tokens/2 which then join together with string:join/2. The result of doing the tokens then join as you have written it seems to be to convert it from a string containing lines separated by \n into one containing lines separated by \r\n. N.B. that this is a flat list of characters.
Is this what you intended?
What you should do now depends on what you mean by "I need to convert it back to a multiline list". Do you mean everything in a single list of characters (string), or in a nested list of lines where each line is a list of characters (string). I.e. if you ended up with
"here is line 1\r\nhere is line 2\r\nhere is line 3\r\n"
this already is a multiline line list, or do you mean
["here is line 1","here is line 2","here is line 3"]
Note that each "string" is itself a list of characters. What do you intend to do with it afterwards?
You have your terms confused. A string in any language is a sequence of integer values corresponding to a human-readable characters. Whether the representation of the value is a binary or a list does not matter, both are technically strings because of the data they contain.
That being said, you converted a binary string to a list string in your first set of instructions. To convert a list into a binary, you can call erlang:list_to_binary/1, or erlang:iolist_to_binary/1 if your list is not flat. For instance:
BinString = <<"this\nis\na\nstring">>.
ListString = "this\nis\na\nstring" = binary_to_list(BinString).
Words = ["this", "is", "a", "string"] = string:tokens(ListString, "\n").
<<"thisisastring">> = iolist_to_binary(Words).
Rejoined = "this\r\nis\r\na\r\nstring" = string:join(Words, "\r\n").
BinAgain = <<"this\r\nis\r\na\r\nstring">> = list_to_binary(Rejoined).
For your reference, the string module always expects a flat list (e.g., "this is a string", but not ["this", "is", "a", "string"]), except for string:join, which takes a list of flat strings.
I have placed the following in cell A1:
"a lot of text marker: xxx some more text"
I would like to copy the xxx value into cell A2.
Any suggestions on how this could be done?
Thanks
=MID(A1, FIND("marker:",A1) + LEN("marker:"), 4)
I am assuming that the xxx (per your example) is 3 characters long and a space is present between "marker:" and "xxx".
Just my two cents. Find() is case sensitive so if the text in A1 is
"a lot of text Marker: xxx some more text"
Then Find will give you an error.
You can use Search() in lieu of FIND()
=MID(A1, SEARCH("marker: ",A1) + LEN("marker: "), 3)
Also depending upon your regional settings you might have to use ";" instead of ","
If you wanted a VBA solution, this worked for me using your sample input:
Function GetValue(rng As Excel.Range) As String
Dim tempValue As String
Dim arrValues() As String
' get value from source range
tempValue = rng.value
' split by ":" character
arrValues = Split(tempValue, ":")
' split by spaces and take the second array element
' because there is a space between ":" and "xxx"
GetXXXValue = Trim$(Split(arrValues(1), " ")(1))
End Function
To use, put this code into the sheet module (see Where do I paste the code that I want to use in my workbook for placement assistance) and then put the following into cell A2:
=GetValue(A1)
I've been given a large file with a funny CSV format to parse into a database.
The separator character is a semicolon (;). If one of the fields contains a semicolon it is "escaped" by wrapping it in doublequotes, like this ";".
I have been assured that there will never be two adjacent fields with trailing/ leading doublequotes, so this format should technically be ok.
Now, for parsing it in VBScript I was thinking of
Replacing each instance of ";" with a GUID,
Splitting the line into an array by semicolon,
Running back through the array, replacing the GUIDs with ";"
It seems to be the quickest way. Is there a better way? I guess I could use substrings but this method seems to be acceptable...
Your method sounds fine with the caveat that there's absolutely no possibility that your GUID will occur in the text itself.
On approach I've used for this type of data before is to just split on the semi-colons regardless then, if two adjacent fields end and start with a quote, combine them.
For example:
Pax;is;a;good;guy";" so;says;his;wife.
becomes:
0 Pax
1 is
2 a
3 good
4 guy"
5 " so
6 says
7 his
8 wife.
Then, when you discover that fields 4 and 5 end and start (respectively) with a quote, you combine them by replacing the field 4 closing quote with a semicolon and removing the field 5 opening quote (and joining them of course).
0 Pax
1 is
2 a
3 good
4 guy; so
5 says
6 his
7 wife.
In pseudo-code, given:
input: A string, first character is input[0]; last
character is input[length]. Further, assume one dummy
character, input[length+1]. It can be anything except
; and ". This string is one line of the "CSV" file.
length: positive integer, number of characters in input
Do this:
set start = 0
if input[0] = ';':
you have a blank field in the beginning; do whatever with it
set start = 2
endif
for each c between 1 and length:
next iteration unless string[c] = ';'
if input[c-1] ≠ '"' or input[c+1] ≠ '"': // test for escape sequence ";"
found field consting of half-open range [start,c); do whatever
with it. Note that in the case of empty fields, start≥c, leaving
an empty range
set start = c+1
endif
end foreach
Untested, of course. Debugging code like this is always fun….
The special case of input[0] is to make sure we don't ever look at input[-1]. If you can make input[-1] safe, then you can get rid of that special case. You can also put a dummy character in input[0] and then start your data—and your parsing—from input[1].
One option would be to find instances of the regex:
[^"];[^"]
and then break the string apart with substring:
List<string> ret = new List<string>();
Regex r = new Regex(#"[^""];[^""]");
Match m;
while((m = r.Match(line)).Success)
{
ret.Add(line.Substring(0,m.Index + 1);
line = line.Substring(m.Index + 2);
}
(Sorry about the C#, I don't known VBScript)
Using quotes is normal for .csv files. If you have quotes in the field then you may see opening and closing and the embedded quote all strung together two or three in a row.
If you're using SQL Server you could try using T-SQL to handle everything for you.
SELECT * INTO MyTable FROM OPENDATASOURCE('Microsoft.JET.OLEDB.4.0',
'Data Source=F:\MyDirectory;Extended Properties="text;HDR=No"')...
[MyCsvFile#csv]
That will create and populate "MyTable". Read more on this subject here on SO.
I would recommend using RegEx to break up the strings.
Find every ';' that is not a part of
";" and change it to something else
that does not appear in your fields.
Then go through and replace ";" with ;
Now you have your fields with the correct data.
Most importers can swap out separator characters pretty easily.
This is basically your GUID idea. Just make sure the GUID is unique to your file before you start and you will be fine. I tend to start using 'Z'. After enough 'Z's, you will be unique (sometimes as few as 1-3 will do).
Jacob