Is there any way I can limit record while performing TextIO? - google-cloud-dataflow

I have a use case where I'm reading around billions of records, but I need to limit the record to see the data behaviour. I have a pardo where I'm analysing the limited data and performing some functionality based on that. But I'm reading entire billion records and then applying limit inside Pardo to get 10000 records. Since my pipeline is reading billion records, it hampers the pipeline performance. Is there any way I could just limit the records, while reading text file using TextIO.

Where are you reading the records from? I think the answer depends on that.
If they all come from, e.g. a same file then I don't think Beam supports sampling a part of them. If they are, e.g. from different files, maybe you can design the file matching pattern you use such that you only read some of them?

You might have to try using a Sample transform, like Sample.any(10000). Perhaps, it will work faster.

Related

Why dask's read_sql_table requires a index_col parameter?

I'm trying to use the read_sql_table from dask but I'm facing some issues related to the index_col parameter. My sql table doesn't have any numeric value and I don't know what to give to the index_col parameter.
I read at the documentation that if the index_col is of type "object" I have to provide the "divisions" parameter, but I don't know what are the values in my index_col before reading the table.
I'm really confused. Don't know why I have to give an index_col when using read_sql_table but don't have to when using read_csv.
I've found in certain situations it's easiest to handle this by scattering DataFrame objects out to the cluster by way of pd.read_sql and its chunksize argument:
from dask import bag as db
sql_text = "SELECT ..."
sql_meta = {"column0": "object", "column1": "uint8"}
sql_conn = connect(...)
dfs_futs = map(client.scatter, # Scatter each object to the cluster
pd.read_sql(sql_text,
sql_conn,
chunksize=10_000, # Iterate in chunks of 10,000
columns=list(sql_meta.keys())))
# Now join our chunks (remotely) into a single frame.
df = db.from_sequence(list(dfs_futs)).to_dataframe(meta=sql_meta)
This is nice since you don't need to handle any potential drivers/packages that would be cumbersome to manage on distributed nodes and/or situations where it's difficult to easily partition your data.
Just a note on performance, for my use case we leverage our database's external table operations to spool data out to a CSV and then read that with pd.read_csv (it's pretty much the same deal as above) while a SELECT ... FROM ... WHERE compared to the way Dask parallelizes and chunks up queries, can be acceptable performance-wise since there is a cost to performing the chunking inside the database.
Dask needs a way to be able to read the partitions of your data independently of one-another. This means being able to phrase the queries of each part with a clause like "WHERE index_col >= val0 AND index_col < val1". If you have nothing numerical, dask cab't guess reasonable values for you, you can still do this if you can determine a way to provide reasonable delimiters, like list(string.ascii_letters). You can also provide your own complete WHERE clauses if you must.
Note that OFFSET/LIMIT does not work for this task, because
the result is not in general guaranteed for any given inputs (this behaviour is database implementation specific)
getting to some particular offset is done by paging through the results of a while query, so the server has to do many time the amount of work necessary

How to ensure insert rate 1 insert per second when using ClickhouseIO

I'm using Apache Beam Java SDK to process events and write them to the Clickhouse Database.
Luckily there is ready to use ClickhouseIO.
ClickhouseIO accumulates elements and inserts them in batch, but because of the parallel nature of the pipeline it still results in a lot of inserts per second in my case. I'm frequently receiving "DB::Exception: Too many parts" or "DB::Exception: Too much simultaneous queries" in Clickhouse.
Clickhouse documentation recommends doing 1 insert per second.
Is there a way I can ensure this with ClickhouseIO?
Maybe some KV grouping before ClickhouseIO.Write or something?
It looks like you interpret these errors not quite correct:
DB::Exception: Too many parts
It means that insert affect more partitions than allowed (by default this value is 100, it is managed by parameter max_partitions_per_insert_block).
So either the count of affected partition is really large or the PARTITION BY-key was defined pretty granular.
How to fix it:
try to group the INSERT-batch such way it contains data related to less than 100 partitions
try to reduce the size of insert-block (if it quite huge) - withMaxInsertBlockSize
increase the limit max_partitions_per_insert_block in SQL-query (like this, INSERT .. SETTINGS max_partitions_per_insert_block=300 (I think ClickhouseIO should have the ability to set custom options on query level)) or on server-side by modifying userprofile-settings
DB::Exception: Too much simultaneous queries
This one managed by param max_concurrent_queries.
How to fix it:
reduce the count of concurrent queries by Beam means
increase the limit on the server-side in userprofile- or server-settings (see https://github.com/ClickHouse/ClickHouse/issues/7765)

Apache-camel Xpathbuilder performance

I have following question. I set up an camel -project to parse certain xml files. I have to selecting take out certain nodes from a file.
I have two files 246kb and 347kb in size. I am extracting a parent-child pair of 250 nodes in the above given example.
With the default factory here are the times. For the 246kb file respt 77secs and 106 secs. I wanted to improve the performance so switched to saxon and the times are as follows 47secs and 54secs. I was able to cut the time down by at least half.
Is it possible to cut the time further, any other factory or optimizations I can use will be appreciated.
I am using XpathBuilder to cut the xpaths out. here is an example. Is it possible to not to have to create XpathBuilder repeatedly, it seems like it has to be constructed for every xpath, I would have one instance and keep pumping the xpaths into it, maybe it will improve performance further.
return XPathBuilder.xpath(nodeXpath)
.saxon()
.namespace(Consts.XPATH_PREFIX, nameSpace)
.evaluate(exchange.getContext(), exchange.getIn().getBody(String.class), String.class);
Adding more details based on Michael's comments. So I am kind of joining them, will become clear with my example below. I am combining them into a json.
So here we go, Lets say we have following mappings for first and second path.
pData.tinf.rexd: bm:Document/bm:xxxxx/bm:PmtInf[{0}]/bm:ReqdExctnDt/text()
pData.tinf.pIdentifi.instId://bm:Document/bm:xxxxx/bm:PmtInf[{0}]/bm:CdtTrfTxInf[{1}]/bm:PmtId/bm:InstrId/text()
This would result in a json as below
pData:{
tinf: {
rexd: <value_from_xml>
}
pIdentifi:{
instId: <value_from_xml>
}
}
Hard to say without seeing your actual XPath expression, but given the file sizes and execution time my guess would be that you're doing a join which is being executed naively as a cartesian product, i.e. with O(n*m) performance. There is probably some way of reorganizing it to have logarithmic performance, but the devil is in the detail. Saxon-EE is quite good at optimizing join queries automatically; if not, there are often ways of doing it manually -- though XSLT gives you more options (e.g. using xsl:key or xsl:merge) than XPath does.
Actually I was able to bring the time down to 10 secs. I am using apache-camel. So I added threads there so that multiple files can be read in separate threads. Once the file was being read, it had serial operation to based on the length of the nodes that had to be traversed. I realized that it was not necessary to be serial here so introduced parrallelStream and that now gave it enough power. One thing to guard agains is not to have a proliferation of threads since that can degrade the performance. So I try to restrict the number of threads to twice or thrice the number of cores on the operating machine.

importing and processing data from a CSV File in Delphi

I had an pre-interview task, which I have completed and the solution works, however I was marked down and did not get an interview due to having used a TADODataset. I basically imported a CSV file which populated the dataset, the data had to be processed in a specific way, so I used Filtering and Sorting of the dataset to make sure that the data was ordered in the way I wanted it and then I did the logic processing in a while loop. The feedback that was received said that this was bad as it would be very slow for large files.
My main question here is if using an in memory dataset is slow for processing large files, what would have been better way to access the information from the csv file. Should I have used String Lists or something like that?
It really depends on how "big" and the available resources(in this case RAM) for the task.
"The feedback that was received said that this was bad as it would be very slow for large files."
CSV files are usually used for moving data around(in most cases that I've encountered files are ~1MB+ up to ~10MB, but that's not to say that others would not dump more data in CSV format) without worrying too much(if at all) about import/export since it is extremely simplistic.
Suppose you have a 80MB CSV file, now that's a file you want to process in chunks, otherwise(depending on your processing) you can eat hundreds of MB of RAM, in this case what I would do is:
while dataToProcess do begin
// step1
read <X> lines from file, where <X> is the max number of lines
you read in one go, if there are less lines(i.e. you're down to 50 lines and X is 100)
to process, then you read those
// step2
process information
// step3
generate output, database inserts, etc.
end;
In the above case, you're not loading 80MB of data into RAM, but only a few hundred KB, and the rest you use for processing, i.e. linked lists, dynamic insert queries(batch insert), etc.
"...however I was marked down and did not get an interview due to having used a TADODataset."
I'm not surprised, they were probably looking to see if you're capable of creating algorithm(s) and provide simple solutions on the spot, but without using "ready-made" solutions.
They were probably thinking of seeing you use dynamic arrays and creating one(or more) sorting algorithm(s).
"Should I have used String Lists or something like that?"
The response might have been the same, again, I think they wanted to see how you "work".
The interviewer was quite right.
The correct, scalable and fastest solution on any medium file upwards is to use an 'external sort'.
An 'External Sort' is a 2 stage process, the first stage being to split each file into manageable and sorted smaller files. The second stage is to merge these files back into a single sorted file which can then be processed line by line.
It is extremely efficient on any CSV file with over say 200,000 lines. The amount of memory the process runs in can be controlled and thus dangers of running out of memory can be eliminated.
I have implemented many such sort processes and in Delphi would recommend a combination of TStringList, TList and TQueue classes.
Good Luck

Optimizing Lucid/Solr to index large text documents

I am trying to index about 3 million text documents in solr. About 1/3 of these files are emails that have about 1-5 paragraphs of text in them. The remaining 2/3 files only have a few words to sentences each.
It takes Lucid/Solr nearly 1 hour to fully index the entire dataset I'm working with. I'm trying to find ways to optimize this. I have setup Lucid/Solr to only commit every 100,000 files, and it indexes the files in batches of 50,000 files at once. Memory isn't an issue anymore, as it consistently stays around 1GB of memory because of the batching.
The entire dataset has to be indexed initially. It's like a legacy system that has to be loaded to a new system, so the data has to be indexed and it needs to be as fast as possible, but I'm not sure what areas to look into to optimize this time.
I'm thinking that maybe there's a lot of little words like "the, a, because, should, if, ..." that are causing a lot of overhead and are just "noise" words. I am curious if I cut them out if it would drastically speed up the indexing time. I have been looking at the Lucid docs for a while, but I can't seem to find a way to specify what words not to index. I came across the term "stop list" but didn't see much more than a reference to it in passing.
Are there other ways to make this indexing go faster or am I just stuck with a 1 hour indexing time?
We met similar problem recently. We can't use solrj as the request and response have to go through some applications, so we take the following steps:
Creating Custom Solr Type to Stream Large Text Field!
Use GZipOutput/InputStream and Bse64Output/InputStream to compress the large text. This can reduce size of text about 85%, this can reduce the time to transfer the request/response.
To reduce memory usage at client side:
2.1 We use stream api(GSon stream or XML Stax) to read doc one by one.
2.2 Define a custom Solr Field Type: FileTextField which accepts FileHolder as value. FileTextField will eventually pass a reader to Lucene. Lucene will use the reader to read content and add to index.
2.3 When the text field is too big, first uncompress it to a temp file, create a FileHolder instance, then set the FileHolder instance as field value.
It seems from your query that Indexing time is really important for your application. Solr is a great search engine however if you need super fast indexing time and if that is a very important criteria for you, than you should go with Sphinx Search Engine. It wont take much of time for you to quickly setup and benchmark your results using Sphinx.
There can be ways (like the one you have mentioned, stopwords etc.) to optimize however whatever you do with respect to indexing time Solr won't be able to beat Sphinx. I have done benchmarking myself.
I too love Solr a lot because of its ease of use, its out of box great features like N-Gram Indexing, Faceting, Multi-core, Spelling Correctors and its integration with other apache products etc.. but when it comes to Optimized Algorithms (be it Index size, Index time etc.) Sphinx rocks!!
Sphinx too is open source. Try that out.

Resources