I have a large file with 2 different formats separated by a dashed line, how can I split the file into two tempfiles for processing?
Example:
yaml:format
yaml:format
yaml:format
---------
csv,format
csv,format
etc.
split at exactly twelve dashes:
yaml, csv = input.split('------------', 2)
or at a variable number of dashes
yaml, csv = input.split(/^-+$/, 2)
this will produce empty lines around the delimiter (end of yaml and start of csv), if you want to get rid of them you can do
yaml, csv = input.split(/[\r\n]+^-+$[\r\n]+/, 2)
Related
h
words_to_guess= (which code should I enter here in order to import words from a text file )
my text file is names word.txt so i tried
words_to_guess =import words.txt
You can assign the content of a .txt file to a variable using this code:
words_to_guess = open("words.txt")
Just make sure that the file "words.txt" is in the same directory as the .py file (or if you're compiling it to a .exe, the same directory as the .exe file)
I would also like to point out that based on the screenshot you provided, it looks like you're trying to get a random word from the .txt file. Once you've done the above code, I would recommend adding this code below it as well:
words_to_guess = words_to_guess.split()
This will take the content of "words_to_guess" and split every word into a list that can be further accessed. You can then call:
word = random.choice(words_to_guess)
And it will select a random element from the array into the "word" variable, hence providing you a random word from the .txt file.
Just note that in the split() function, a word is determined by the spaces in between, so if you have a word like "Halloween Pumpkin" or "American Flag", the split() function would make each individual word an element, so it would be transferred into ["Halloween", "Pumpkin"] or ["American", "Flag"].
That's all!
I'm receiving a CSV file that always includes extra lines at the end which I'd like to remove before copying the data into the postgresql database of my rails app.
I can't use head with a negative argument because I'm on MacOS X.
What's a clean and efficient way to pre-process this file?
Right now I'm doing this, but am wondering if there is less mish-mash way:
# Removes last n rows from the file located at PATH
total = `wc -c < #{PATH}`.strip.to_i
chop_index = `tail -n #{n} #{PATH} | wc -c`.strip.to_i
`dd if=/dev/null of=#{PATH} seek=1 bs=#{total - chop_index}`
This is about the simplest way I can think to do this in pure ruby that also works for large files, since it processes each line at a time instead of reading the whole file into memory:
INFILE = "input.txt"
OUTFILE = "output.txt"
total_lines = File.foreach(INFILE).inject(0) { |c, _| c+1 }
desired_lines = total_lines - 4
# open output file for writing
File.open(OUTFILE, 'w') do |outfile|
# open input file for reading
File.foreach(INFILE).with_index do |line, index|
# stop after reaching the desired line number
break if index == desired_lines
# copy lines from infile to outfile
outfile << line
end
end
However, this is about twice as slow as what you posted on a 160mb file I created. You can shave off about a third by using wc to get the total lines, and using pure Ruby for the rest:
total_lines = `wc -l < #{INFILE}`.strip.to_i
# rest of the Ruby File code
Another caveat is that your CSV must not have it's own line breaks within any cell content, in which case, you would need a CSV parser, and CSV.foreach(INFILE) do |row| could be used instead, but it is quite a bit slower in my limited testing, but you mentioned above that your cells should be ok to be processes by file line.
That said, what you posted using wc and dd is much faster, so maybe you should keep using that.
I am using Smarter CSV to and have encountered a csv that has blank lines. Is there anyway to ignore these? Smarter CSV is taking the blank line as a header and not processing the file correctly. Is there any way I can bastardize the comment_regexp?
mail.attachments.each do | attachment |
filename = attachment.filename
#filedata = attachment.decoded
puts filename
begin
tmp = Tempfile.new(filename)
tmp.write attachment.decoded
tmp.close
puts tmp.path
f = File.open(tmp.path, "r:bom|utf-8")
options = {
:comment_regexp => /^#/
}
data = SmarterCSV.process(f, options)
f.close
puts data
Sample File:
[
output
Let's first construct your file.
str = <<~_
#
# Report
#---------------
Date header1 header2 header3 header4
20200 jdk;df 4543 $8333 4387
20200 jdk 5004 $945876 67
_
fin_name = 'in'
File.write(fin_name, str)
#=> 223
Two problems must be addressed to read this file using the method SmarterCSV::process. The first is that comments--lines beginning with an octothorpe ('#')--and blank lines must be skipped. The second is that the field separator is not a fixed-length string.
The first of these problems can be dealt with by setting the value of process' :comment_regexp option key to a regular expression:
:comment_regexp => /\A#|\A\s*\z/
which reads, "match an octothorpe at the beginning of the string (\A being the beginning-of-string anchor) or (|) match a string containing zero or more whitespace characters (\s being a whitespace character and \z being the end-of-string anchor)".
Unfortunately, SmarterCSV is not capable of dealing with variable-length field separators. It does have an option :col_sep, but it's value must be a string, not a regular expression.
We must therefore pre-process the file before using SmarterCSV, though that is not difficult. While are are at, we may as well remove the dollar signs and use commas for field separators.1
fout_name = 'out.csv'
fout = File.new(fout_name, 'w')
File.foreach(fin_name) do |line|
fout.puts(line.strip.gsub(/\s+\$?/, ',')) unless
line.match?(/\A#|\A\s*\z/)
end
fout.close
Let's look at the file produced.
puts File.read(fout_name)
displays
Date,header1,header2,header3,header4
20200,jdk;df,4543,8333,4387
20200,jdk,5004,945876,67
Now that's what a CSV file should look like! We may now use SmarterCSV on this file with no options specified:
SmarterCSV.process(fout_name)
#=> [{:date=>20200, :header1=>"jdk;df", :header2=>4543,
# :header3=>8333, :header4=>4387},
# {:date=>20200, :header1=>"jdk", :header2=>5004,
# :header3=>945876, :header4=>67}]
1. I used IO::foreach to read the file line-by-line and then write each manipulated line that is neither a comment nor a blank line to the output file. If the file is not huge we could instead gulp it into a string, modify the string and then write the resulting string to the output file: File.write(fout_name, File.read(fin_name).gsub(/^#.*?\n|^[ \t]*\n|^[ \t]+|[ \t]+$|\$/, '').gsub(/[ \t]+/, ',')). The first regular expression reads, "match lines beginning with an octothorpe or lines containing only spaces and tabs or spaces and tabs at the beginning of a line or spaces and tabs at the end of a line or a dollar sign". The second gsub merely converts multiple tabs and spaces to a comma.
File.new(fout_name, 'w')
File.foreach(fin_name) do |line|
fout.puts(line.strip.gsub(/\s+\$?/, ',')) unless
line.match?(/\A#|\A\s*\z/)
end
fout.close
I have a 26 million rows dataset and when I try parsing it with uniVocity parser it reads it as 18 million rows only.
My rows field count varies from 158 to 162 with delimiter as ASCII '\u0001'.
wc -l output from linux >>>> wc -l withHeader.dat
26351323 withHeader.dat
But parser reads it as Total # of rows in file = 18554088 ( output from list.size of parser.parseAll() )
Can some one explain what could be the issue ?
this is my parserSettings
settings.getFormat().setLineSeparator("\n");
settings.selectFields("acctId","tcat", "transCode");
settings.getFormat().setDelimiter('\u0001');
//settings.setAutoConfigurationEnabled(true);
//settings.setMaxColumns(86);
settings.setHeaderExtractionEnabled(false);
// creates a CSV parser
CsvParser parser = new CsvParser(settings);
// parses all rows in one go.
List<String[]> allRows = parser.parseAll(newReader(filePath));
System.out.println("Total # of rows in file = " + allRows.size());
If your values can contain line separators, then the number of parsed records won't be equal to the number of lines.
If that's not the case, then it's likely you are not configuring the format correctly. You might need to configure quotes, quote escapes, etc.
My first suggestion is to try to detect the format automatically with:
settings.detectFormatAutomatically();
After parsing, check if you got the row count you expect to find. You can get what has been detected by calling:
CsvFormat detectedFormat = parser.getDetectedFormat();
Keep in mind this process is not guaranteed to work but in the majority of cases it does the trick. These features are available as of version 2.0.0.
If nothing helps, please attach (part of) your input file so I can take a look and update my answer.
I am writing a script that reads from a binary file, converts to ASCII, extracts/delimits 2 columns, and pipes it out to a txt.
I looked at this post to implement the binary > ASCII step, but, in the way that it is implemented in my script, it seems to only perform the above process on the first row in the file.
How would I re-write this to loop through all rows in the file?
My code is below.
# run the command script to extract the file
script.cmd
# Read the entire file to an array of bytes.
$bytes = [System.IO.File]::ReadAllBytes("filePath")
# Decode first 'n' number of bytes to a text assuming ASCII encoding.
$text = [System.Text.Encoding]::ASCII.GetString($bytes, 0, 999999)|
# only keep columns 0-22; 148-149; separate with comma delimiter
%{ "$($_[$0..22] -join ''),$($_[147..147] -join '')"} |
# convert the file to .txt
set-content path\file.txt
Also, what is a more elegant way of writing this part so it just reads the length of the string, instead of pulling in up to 999999 bytes?
$text = [System.Text.Encoding]::ASCII.GetString($bytes, 0, 999999)|
You don't need to specify index and count. Simply use
[System.Text.Encoding]::ASCII.GetString($bytes).Split("`r`n",[System.StringSplitOptions]::RemoveEmptyEntries)
or
[System.Text.Encoding]::ASCII.GetString([System.IO.File]::ReadAllBytes("filePath")).Split("`r`n",[System.StringSplitOptions]::RemoveEmptyEntries)
I'm not sure why you would want to read it as bytes, when you could simply use Get-Content.