I am reading from a txt file which has two lines, the first one containing numbers followed by a space. The second line is just empty.
1 2 3
Using the code below to read the file results in an error (FormatException). The fourth element of tmp is an empty string according to the debugger (see below), but has the length 1, according to the print statement, and is not caught by removeWhere().
import 'dart:io';
void main() async {
String filename = "text.txt";
String content = await File("$filename").readAsString();
List<String> lines = content.split("\n");
List<String> tmp = lines[0].split(" ");
tmp.removeWhere((element) => element.isEmpty);
print(tmp[3].length);
print(tmp.map((e) {
return int.parse(e);
}).toList());
}
Removing the empty line of the text file surprisingly solves the problem.
I do not understand the reason for the problem. Is there any solution?
Probably, the file is stored with CR LF line endings (Windows format). CR LF is '\r\n'. The lines are split by '\n', so lines[0] is '1 2 3 \r'.
Store the text file with LF line endings (Unix format), replace all '\r\n' in content with '\n' before splitting, or use LineSplitter, which accepts all line endings (CR, LF, and CR LF), or the readAsLines method directly. LineSplitter and readAsLines also strip the last line if it is empty.
Related
I am using Smarter CSV to and have encountered a csv that has blank lines. Is there anyway to ignore these? Smarter CSV is taking the blank line as a header and not processing the file correctly. Is there any way I can bastardize the comment_regexp?
mail.attachments.each do | attachment |
filename = attachment.filename
#filedata = attachment.decoded
puts filename
begin
tmp = Tempfile.new(filename)
tmp.write attachment.decoded
tmp.close
puts tmp.path
f = File.open(tmp.path, "r:bom|utf-8")
options = {
:comment_regexp => /^#/
}
data = SmarterCSV.process(f, options)
f.close
puts data
Sample File:
[
output
Let's first construct your file.
str = <<~_
#
# Report
#---------------
Date header1 header2 header3 header4
20200 jdk;df 4543 $8333 4387
20200 jdk 5004 $945876 67
_
fin_name = 'in'
File.write(fin_name, str)
#=> 223
Two problems must be addressed to read this file using the method SmarterCSV::process. The first is that comments--lines beginning with an octothorpe ('#')--and blank lines must be skipped. The second is that the field separator is not a fixed-length string.
The first of these problems can be dealt with by setting the value of process' :comment_regexp option key to a regular expression:
:comment_regexp => /\A#|\A\s*\z/
which reads, "match an octothorpe at the beginning of the string (\A being the beginning-of-string anchor) or (|) match a string containing zero or more whitespace characters (\s being a whitespace character and \z being the end-of-string anchor)".
Unfortunately, SmarterCSV is not capable of dealing with variable-length field separators. It does have an option :col_sep, but it's value must be a string, not a regular expression.
We must therefore pre-process the file before using SmarterCSV, though that is not difficult. While are are at, we may as well remove the dollar signs and use commas for field separators.1
fout_name = 'out.csv'
fout = File.new(fout_name, 'w')
File.foreach(fin_name) do |line|
fout.puts(line.strip.gsub(/\s+\$?/, ',')) unless
line.match?(/\A#|\A\s*\z/)
end
fout.close
Let's look at the file produced.
puts File.read(fout_name)
displays
Date,header1,header2,header3,header4
20200,jdk;df,4543,8333,4387
20200,jdk,5004,945876,67
Now that's what a CSV file should look like! We may now use SmarterCSV on this file with no options specified:
SmarterCSV.process(fout_name)
#=> [{:date=>20200, :header1=>"jdk;df", :header2=>4543,
# :header3=>8333, :header4=>4387},
# {:date=>20200, :header1=>"jdk", :header2=>5004,
# :header3=>945876, :header4=>67}]
1. I used IO::foreach to read the file line-by-line and then write each manipulated line that is neither a comment nor a blank line to the output file. If the file is not huge we could instead gulp it into a string, modify the string and then write the resulting string to the output file: File.write(fout_name, File.read(fin_name).gsub(/^#.*?\n|^[ \t]*\n|^[ \t]+|[ \t]+$|\$/, '').gsub(/[ \t]+/, ',')). The first regular expression reads, "match lines beginning with an octothorpe or lines containing only spaces and tabs or spaces and tabs at the beginning of a line or spaces and tabs at the end of a line or a dollar sign". The second gsub merely converts multiple tabs and spaces to a comma.
File.new(fout_name, 'w')
File.foreach(fin_name) do |line|
fout.puts(line.strip.gsub(/\s+\$?/, ',')) unless
line.match?(/\A#|\A\s*\z/)
end
fout.close
I have a file with an old format from the 70s used in Companies House (UK company registry).
I inherited a parser written 6 years ago which goes line by line and according to a set of conditions extracts the information from the line and inserts them into a dictionary.
There is a weird character that is breaking a line.
I copied this line to a new file awk '{if(NR==33411) print $0}' PROD216_1950_ew_1.dat > broken and opend broken in vim.
Turns out that weird character is read by vim a <85>.
The result is that everything after MAYFIELD is read as a new line.
Below the line in question:
000376702103032986930001 1993010119941024 193709 0105<BARRY ALEXANDER<GROSVENOR<<<<MAYFIELD 3<41 PLANTATION ROAD<THE PEAK<<HONG KONG<BANK EXECUTIVE<BRITISH<<
in vim becomes
000376702103032986930001 1993010119941024 193709 0105<BARRY ALEXANDER<GROSVENOR<<<<MAYFIELD <85>3<41 PLANTATION ROAD<THE PEAK<<HONG KONG<BANK EXECUTIVE<BRITISH<<
I am using codecs to read this file with a context manager, which I thought was the way of going about it -
Is there anything I am missing? What is that <85>?
with codecs.open(filepath, 'r', 'utf-8') as fh:
for line in fh:
linetype = determine_line_type(line)
if linetype == 'header':
continue
elif linetype == 'company':
do stuff...
elif linetype == 'officer':
do stuff...
vim shows <85> to indicate a hex 85 byte that is invalid in the current encoding (i.e., the encoding it's using to decode the file).
My guess is that the file's encoding is Windows-1252, in which hex 85 denotes the ellipsis character.
So the solution for your parser might be as simple as changing 'utf-8' to 'cp1252' in the codecs.open call.
After going around for some time here and here I came up with this solution, which works.
with open(filepath, encoding='utf-8') as fh:
for line in fh:
byteline = bytearray(line, encoding='utf-8').replace(b'\xc2\x85', b'')
line_clean = byteline.decode(encoding='utf-8')
# do stuff with clean line.
Knowing that the byte sequence that breaks the string is b'\xc2\x85' (it is interpreted as an ... ellipsis character.
First encode the string to an array of bytes with bytearray, then use replace method of the bytearray class, finally, decode the clean line using the decode method, which will return the string without the weird character from before the transformation.
I am currently trying to read a file as this:
53**7****\n6**195***\n*98****6*\n8***6***3\n4**8*3**1\n7***2***6\n*6****28*\n***419**5\n****8**79\n
And write it into the screen, but with new lines instead of the /n.
On the msdn description of the method StreamReader.ReadLine () it says that:
A line is defined as a sequence of characters followed by a line feed ("\n"), a carriage return ("\r"), or a carriage return immediately followed by a line feed ("\r\n"). The string that is returned does not contain the terminating carriage return or line feed. The returned value is null if the end of the input stream is reached.
Why does my program not interpret \n as a new line?
Well, the problem is that the documentation for ReadLine is talking about the '\n' (single) character, while you actually have a "\\n" two-character string.
In C#, \ is used as an escape character - for example, \n represents the character with ASCII value of 10. However, files are not parsed according to C# rules (that's a good thing!). Since your file doesn't have the literal 10-characters, they aren't interpreted as endlines, and rightly so - the literal translation in ASCII would be (92, 110).
Just use Split (or Replace), and you'll be fine. Basically, you want to replace "\\n" with "\n" (or better, Environment.NewLine).
I used #Mark Seemann's method by letting let s = inputStream.ReadToEnd () and thereby importing the string you are typing in directly. I am able to print out the same output as you with your do-while loop, but i have to use this recursive printFile method:
let rec printFile (reader : System.IO.StreamReader) =
if not(reader.EndOfStream) then
let line = reader.ReadLine ()
printfn "%s" line
printFile reader
This however does not recognize the \n as new lines - do you know why as i see the methods as very similar? Thanks!
I can't reproduce the issue. This seems to work fine:
let s = "53**7****\n6**195***\n*98****6*\n8***6***3\n4**8*3**1\n7***2***6\n*6****28*\n***419**5\n****8**79\n"
open System.IO
let strm = new MemoryStream()
let sw = new StreamWriter(strm)
sw.Write s
sw.Flush ()
strm.Position <- 0L
let sr = new StreamReader(strm)
while (not sr.EndOfStream) do
let line = sr.ReadLine ()
printfn "%s" line
This prints
53**7****
6**195***
*98****6*
8***6***3
4**8*3**1
7***2***6
*6****28*
***419**5
****8**79
Below is the (crude) method I'm using to export the contents of a table into a CSV. I came up with this on the fly, however the data in the table has been loaded from an Excel spreadsheet created by a Sharepoint site. I do not know if that conversion process or my method is the cause, but a large number of these characters: Â are being imported into the cells.
Also, a large number of records are having their fields split up into two rows as opposed to just one. This is my first attempt at exporting to a CSV programatically (as opposed to using excel) so any help would be greatly apprecitated.
Controller Method
public ActionResult ExportToCsv()
{
using (StringWriter writer = new StringWriter())
{
var banks = db.BankListMaster.Include(b => b.BankListAgentId).ToList();
writer.WriteLine("Bank Name, EPURL, AssociatedTPMBD, Tier, FixedLifeMasterSAF, VariableLifeMasterSAF, FixedLifeSNY, VariableLifeMasterSNY, SpecialNotes, WelcomeLetterReq, " +
"BackOfficeNotification, LinkRepsToDynamics, RelationshipCode, INDSGC, PENSGC, LicensingContract, MiscellaneousNotes, ContentTypeID1, CreatedBy, MANonresBizNY, Attachment");
foreach (var item in banks)
{
writer.Write(item.BankName + ",");
if(String.IsNullOrWhiteSpace(item.EPURL))
{
writer.Write(item.EPURL + ",");
}
else
{
writer.Write(item.EPURL.Trim() + ",");
}
writer.Write(item.AssociatedTPMBD + ",");
writer.Write(item.Tier + ",");
writer.Write(item.LicensingContract + ",");
writer.Write(item.MiscellaneousNotes + ",");
writer.Write(item.ContentTypeID1 + ",");
writer.Write(item.CreatedBy + ",");
writer.Write(item.MANonresBizNY + ",");
writer.Write(item.Attachment);
writer.Write(writer.NewLine);
}
return File(new System.Text.UTF8Encoding().GetBytes(writer.ToString().Replace("Â", "")), "text/csv", "BankList.csv");
}
}
CSV is a file format that's poorly specified. Several important things aren't specified:
The field separator. Even though it's called "comma separated", Excel will use the semicolon sometimes (depending on your locale!).
The encoding (UTF-8, ISO-8859-1, ANSI/Windows-1252 etc.)
The kind of newlines (CR, NL or CR NL).
Whether all fields have to be quoted with double quotes or just the ones containing the field separator, the line separator, blanks etc.
Whether white space is trimmed from unquoted fields.
Whether newlines are allowed within quoted fields (Excel allows them).
How double quotes are escaped if they are part of the field content (normally they are doubled)
Excel is usually the reference for a valid CSV format. But even Excel chooses the field separator and the encoding depending on your locale.
In your case, the encoding is most likely the main problem. You use UTF-8 but the consumer treats it as ISO-8859-1 or ANSI. For that reason, the character  often appears whose binary code is used in UTF-8 to introduce a two byte sequence. Change the encoding to fix the Â.
As the next step, properly quote the text fields, i.e. add double quotes at the start and at the end and double all double quotes within the field.
I am modifying a delphi app.In it I'm getting a text from a combo box. The problem is that when I save the text in the table, it contains a carriage return. In debug mode it shows like this.
newStr := 'Projector Ex320u-st Short Throw '#$A'1024 X 768 2700lm'
Then I have put
newStr := StringReplace(newStr,'#$A','',[rfReplaceAll]);
to remove the '#$A' thing. But this doesn't remove it.
Is there any other way to do this..
Thanks
Remove the quotes around the #$A:
newStr := StringReplace(newStr,#$A,'',[rfReplaceAll]);
The # tells delphi that you are specifying a character by its numerical code.
The $ says you are specifying in Hexadecimal.
The A is the value.
With the quotes you are searching for the presence of the #$A characters in the string, which aren't found, so nothing is replaced.
Adapted from http://www.delphipages.com/forum/showthread.php?t=195756
The '#' denotes an ASCII character followed by a byte value (0..255).
The $A is hexadecimal which equals 10 and $D is hexadecimal which equals 13.
#$A and #$D (or #10 and #13) are ASCII line feed and carriage return characters respectively.
Line feed = ASCII character $A (hex) or 10 (dec): #$A or #10
Carriage return = ASCII character $D (hex) or 13 (dec): #$D or #13
So if you wanted to add 'Ok' and another line:
Memo.Lines.Add('Ok' + #13#10)
or
Memo.Lines.Add('Ok' + #$D#$A)
To remove the control characters (and white spaces) from the beginning
and end of a string:
MyString := Trim(MyString)
Why doesn't Pos() find them?
That is how Delphi displays control characters
to you, if you were to do Pos(#13, MyString) or Pos(#10, MyString) then it
would return the position.