Reading and parsing a large .dat file - parsing

I am trying to parse a huge .dat file (4gb). I have tried with R but it just takes too long. Is there a way to parse a .dat file by segments, for example every 30000 lines? Any other solutions would also be welcomed.
This is what it looks like:
These are the first two lines with header:
ST|ZIPCODE|GEO_ID|GEO_TTL|FOOTID_GEO|NAICS2012|NAICS2012_TTL|FOOTID_NAICS|YEAR|EMPSZES|EMPSZES_TTL|ESTAB|ESTAB_F <br/>
01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|001|All establishments|167| <br/>
01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|212|Establishments with 1 to 4 employees|91|

This is an option to read data faster in R by using the fread function in the data.table package.
EDIT
I removed all <br/> new-line tags. This is the edited dataset
ST|ZIPCODE|GEO_ID|GEO_TTL|FOOTID_GEO|NAICS2012|NAICS2012_TTL|FOOTID_NAICS|YEAR|EMPSZES|EMPSZES_TTL|ESTAB|ESTAB_F
01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|001|All establishments|167|
01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|212|Establishments with 1 to 4 employees|91|
Then I matched variables with classes. You should use nrows ~ 100.
colclasses = sapply(read.table(edited_data, nrows=1, sep="|", header=T),class)
Then I read the edited data.
your_data <- fread(edited_data, sep="|", sep2=NULL, nrows=-1L, header=T, na.strings="NA",
stringsAsFactors=FALSE, verbose=FALSE, autostart=30L, skip=-1L, select=NULL,
colClasses=colclasses)
Everything worked like a charm. In case you have problems removing the tags, use this simple Python script (it will take some time for sure):
original_file = file_path_to_original_file # e.g. "/Users/User/file.dat"
edited_file = file_path_to_new_file # e.g. "/Users/User/file_edited.dat"
with open(original_file) as inp:
with open(edited_file, "w") as op:
for line in inp:
op.write(line.replace("<br/>", "")
P.S.
You can use read.table with similar optimizations, but it won't give you nearly as much speed.

Related

Filtering input file with chunksize and skiprows using line number as index in dask dataframe

I have ~70gb output of MD simulations. A pattern of a fixed-number-of-lines explanation and a fixed-number-of-lines data regularly repeat in the file. How can I read the file in Dask Dataframe chunk by chunk in which the explanation lines are ignored?
I successfully wrote a lambda function in the skiprows argument of the pandas.read_csv to ignore the explanation lines and only read the data lines. I converted the pandas-entered code to dask one but it does not work. Here you can see the dask code written by replacing pandas.read_csv with dd.read_csv:
# First extracting number of atoms and hence, number of data lines:
with open(filename[0],mode='r') as file: # The same as Chanil's code
line = file.readline()
line = file.readline()
line = file.readline()
line = file.readline() # natoms
natoms = int(line)
skiplines = 9 # Number of explanation lines repeating after nnatoms lines of data
def logic_for_chunk(index):
"""This function read a chunk """
if index % (natoms+skiplines) > 8:
return False
return True
df_chunk = dd.read_csv('trajectory.txt',sep=' ',header=None,index_col=False,skiprows=lambda x: logic_for_chunk(x),chunksize=natoms)
Here the indexes of the dataframe is line numbers of the file. Using above code, at the first chunk, lines 0 to 8 in file are ignored, then the lines 9 to 58 are read. At the next chunk, the line 59 to 67 are ignored and then a natoms-size chunk from line 68 to 117 are read. This happens until all the data snapshots are read.
Unfortunately, while the above code works well in pandas, it does not works in dask. How can I implement a similar procedure in dask dataframe?
The dask dataframe read_csv function cuts the file up at byte locations. It is unable to determine exactly how many lines are in each partition, so it is unwise to depend on the row index within each partition.
If there is some other way to detect a bad line then I would try that. Ideally you will be able to determine a bad line based on the content of the line, not on its location within the file (like every eighth line).

spaCy: optimizing tokenization

I'm currently trying to tokenize a text file where each line is the body text of a tweet:
"According to data reported to FINRA, short volume percent for $SALT clocked in at 39.19% on 12-29-17 http://www.volumebot.com/?s=SALT"
"#Good2go #krueb The chart I posted definitely supports ng going lower. Gobstopper' 2.12, might even be conservative."
"#Crypt0Fortune Its not dumping as bad as it used to...."
"$XVG.X LOL. Someone just triggered a cascade of stop-loss orders and scooped up morons' coins. Oldest trick in the stock trader's book."
The file is 59,397 lines long (a day's worth of data) and I'm using spaCy for pre-processing/tokenization. It's currently taking me around 8.5 minutes and I was wondering if there were any way of optimising the following code to be quicker as 8.5 minutes seems awfully long for this process:
def token_loop(path):
store = []
files = [f for f in listdir(path) if isfile(join(path, f))]
start_time = time.monotonic()
for filename in files:
with open("./data/"+filename) as f:
for line in f:
tokens = nlp(line.lower())
tokens = [token.lemma_ for token in tokens if not token.orth_.isspace() and token.is_alpha and not token.is_stop and len(token.orth_) != 1]
store.append(tokens)
end_time = time.monotonic()
print("Time taken to tokenize:",timedelta(seconds=end_time - start_time))
return store
Although it says files, it's currently only looping over 1 file.
Just to note, I only need this to tokenize the content; I don't need any extra tagging etc.
It sounds like you haven't optimised the pipeline yet. You'll get a significant speed up from disabling the pipeline components you don't need, like so:
nlp = spacy.load('en', disable=['parser', 'tagger', 'ner'])
This should get you down to about the two-minute mark, or better, on its own.
If you need a further speed up, you can look at multi-threading using nlp.pipe. Docs for multi-threading are here:
https://spacy.io/usage/processing-pipelines#section-multithreading
You can use nlp.pipe(all_lines) instead of nlp(line) for a faster processing
see Spacy's documentation - https://spacy.io/usage/processing-pipelines

How can I generate a single .avro file for large flat file with 30MB+ data

currently two avro files are getting generated for 10 kb file, If I follow the same thing with my actual file (30MB+) I will n number of files.
so need a solution to generate only one or two .avro files even if the source file of large.
Also is there any way to avoid manual declaration of column names.
current approach...
spark-shell --packages com.databricks:spark-csv_2.10:1.5.0,com.databricks:spark-avro_2.10:2.0.1
import org.apache.spark.sql.types.{StructType, StructField, StringType}
// Manual schema declaration of the 'co' and 'id' column names and types
val customSchema = StructType(Array(
StructField("ind", StringType, true),
StructField("co", StringType, true)))
val df = sqlContext.read.format("com.databricks.spark.csv").option("comment", "\"").option("quote", "|").schema(customSchema).load("/tmp/file.txt")
df.write.format("com.databricks.spark.avro").save("/tmp/avroout")
// Note: /tmp/file.txt is input file/dir, and /tmp/avroout is the output dir
Try specifying number of partitions of your dataframe while writing the data as avro or any format. To fix this use repartition or coalesce df function.
df.coalesce(1).write.format("com.databricks.spark.avro").save("/tmp/avroout")
So that it writes only one file in "/tmp/avroout"
Hope this helps!

Lua true Binary I/O

I read this question, and I checked it myself.
I use the following snippet:
f = io.open("file.file", "wb")
f:write(1.34)
f:close()
This creates the file, into which 1.34 is written. This is same as : 00110001 00101110 00110011 00110100 , that is binary codes for the digit 1, the decinal point, then 3 and finally 4.
However, I would like to have printed 00111111 10101100 11001100 11001101, which is a true float representation. How do I do so?
You may need to convert it to binary representation, using something similar to this answer. This discussion on serialization of lua numbers may also be useful.

Parsing a file with BodyParser in Scala Play20 with new lines

Excuse the n00bness of this question, but I have a web application where I want to send a potentially large file to the server and have it parse the format. I'm using the Play20 framework and I'm new to Scala.
For example, if I have a csv, I'd like to split each row by "," and ultimately create a List[List[String]] with each field.
Currently, I'm thinking the best way to do this is with a BodyParser (but I could be wrong). My code looks something like:
Iteratee.fold[String, List[List[String]]]() {
(result, chunk) =>
result = chunk.splitByNewLine.splitByDelimiter // Psuedocode
}
My first question is, how do I deal with a situation like the one below where a chunk has been split in the middle of a line:
Chunk 1:
1,2,3,4\n
5,6
Chunk 2:
7,8\n
9,10,11,12\n
My second question is, is writing my own BodyParser the right way to go about this? Are there better ways of parsing this file? My main concern is that I want to allow the files to be very large so I can flush a buffer at some point and not keep the entire file in memory.
If your csv doesn't contain escaped newlines then it is pretty easy to do a progressive parsing without putting the whole file into memory. The iteratee library comes with a method search inside play.api.libs.iteratee.Parsing :
def search (needle: Array[Byte]): Enumeratee[Array[Byte], MatchInfo[Array[Byte]]]
which will partition your stream into Matched[Array[Byte]] and Unmatched[Array[Byte]]
Then you can combine a first iteratee that takes a header and another that will fold into the umatched results. This should look like the following code:
// break at each match and concat unmatches and drop the last received element (the match)
val concatLine: Iteratee[Parsing.MatchInfo[Array[Byte]],String] =
( Enumeratee.breakE[Parsing.MatchInfo[Array[Byte]]](_.isMatch) ><>
Enumeratee.collect{ case Parsing.Unmatched(bytes) => new String(bytes)} &>>
Iteratee.consume() ).flatMap(r => Iteratee.head.map(_ => r))
// group chunks using the above iteratee and do simple csv parsing
val csvParser: Iteratee[Array[Byte], List[List[String]]] =
Parsing.search("\n".getBytes) ><>
Enumeratee.grouped( concatLine ) ><>
Enumeratee.map(_.split(',').toList) &>>
Iteratee.head.flatMap( header => Iteratee.getChunks.map(header.toList ++ _) )
// an example of a chunked simple csv file
val chunkedCsv: Enumerator[Array[Byte]] = Enumerator("""a,b,c
""","1,2,3","""
4,5,6
7,8,""","""9
""") &> Enumeratee.map(_.getBytes)
// get the result
val csvPromise: Promise[List[List[String]]] = chunkedCsv |>>> csvParser
// eventually returns List(List(a, b, c),List(1, 2, 3), List(4, 5, 6), List(7, 8, 9))
Of course you can improve the parsing. If you do, I would appreciate if you share it with the community.
So your Play2 controller would be something like:
val requestCsvBodyParser = BodyParser(rh => csvParser.map(Right(_)))
// progressively parse the big uploaded csv like file
def postCsv = Action(requestCsvBodyParser){ rq: Request[List[List[String]]] =>
//do something with data
}
If you don't mind holding twice the size of List[List[String]] in memory then you could use a body parser like play.api.mvc.BodyParsers.parse.tolerantText:
def toCsv = Action(parse.tolerantText) { request =>
val data = request.body
val reader = new java.io.StringReader(data)
// use a Java CSV parsing library like http://opencsv.sourceforge.net/
// to transform the text into CSV data
Ok("Done")
}
Note that if you want to reduce memory consumption, I recommend using Array[Array[String]] or Vector[Vector[String]] depending on if you want to deal with mutable or immutable data.
If you are dealing with truly large amount of data (or lost of requests of medium size data) and your processing can be done incrementally, then you can look at rolling your own body parser. That body parser would not generate a List[List[String]] but instead parse the lines as they come and fold each line into the incremental result. But this is quite a bit more complex to do, in particular if your CSV is using double quote to support fields with commas, newlines or double quotes.

Resources