Allow User to Extract Data Dumps From DW - data-warehouse

We use synapse in azure as our warehouse and create reports in power bi for our users on top of this. We currently have a request to move all of the data dumps from our production system onto our warehouse DB as some of them are causing performance issue in production when run. We've been looking to re-do these into reports in power bi, however in some instances we still need to provide the "raw" data in csv/excel format. This has thrown an issue as some of these extracts are above 150k rows and therefore we can't use power bi to provide the extract as it has a limit on the rows it can export. Our solution would be to build a process to runs against the db and spits out a file into sharepoint for the user to consume, which we can do however we're unsure of how we could provide a method of the user triggering the extract. One of the ways I was thinking of doing it would be using power apps, however I'm wondering if there is an easier way someone on here might be able to suggest? I just need to provide pages with various buttons that trigger extracts to sharepoint from azure when clicked, which can be controlled by security in some way. Any advice would be appreciated.

Paginated Report Export doesn't have that row limit.
See, eg
https://learn.microsoft.com/en-us/power-bi/collaborate-share/service-automate-paginated-integration
Or you can use ADF Copy Activity to create .csv extracts.

Related

Chatbot creation on GCP with data on Google Cloud Storage

I have a requirement to build an Interactive chatbot to answer Queries from Users .
We get different source files from different source systems and we are maintaining log of when files arrived, when they processed etc in a csv file on google cloud storage. Every 30 mins csv gets generated with log of any new file which arrived and being stored on GCP.
Users keep on asking via mails whether Files arrived or not, which file yet to come etc.
If we can make a chatbot which can read csv data on GCS and can answer User queries then it will be a great help in terms of response times.
Can this be achieved via chatbot?
If so, please help with most suitable tools/Coding language to achieve this.
You can achieve what you want in several ways. All depends what are your requirements in response time and CSV size
Use BigQuery and external table (also called federated table). When you define it, you can choose a file (or a file pattern) in GCS, like a csv. Then you can query your data with a simple SQL query. This solution is cheap and easy to deploy. But Bigquery has latency (depends of your file size, but can take several seconds)
Use Cloud function and Cloud SQL. When the new CSV file is generated, plug a function on this event. The function parse the file and insert data into Cloud SQL. Be careful, the function can live up to 9 minutes and max 2Gb can be assign to it. If your file is too large, you can break these limit (time and/or memory). The main advantage is the latency (set the correct index and your query is answered in millis)
Use nothing! In the fulfillment endpoint, get your CSV file, parse it and find what you want. Then release it. Here, you do nothing, but the latency is terrible, the processing huge, you have to repeat the file download and parse,... Ugly solution, but can work if your file is not too large for being in memory
We can also imagine more complex solution with dataflow, but I feel that isn't your target.

Adobe Analytics | Merge data from multiple report suites

We are capturing information for consumer sites in multiple different report suites.
Is it possible to merge all these data to a parent report suite without adding that parent report suite's account id in s_account variable?
For example
Site 1 uses report-suite1
s_account = "report-suite1";
Site 2 uses report-suite2
s_account = "report-suite2"
Instead of using
s_account = "report-suite1,report-suite2"
is it possible to merge the data to a 3rd virtual account from the Reports console itself?
The only way you can route data to a separate fully fledged report suite is either via javascript (e.g. setting s_account as you have shown in your post), or to ask Adobe To create a VISTA rule.
You didn't state your reasons for not wanting to throw a "global" rsid into your js code. Is it because you don't have the technical resources/ability to do it? If so, and if you want a full 3rd rsid for all the data to go to, then you can ask Adobe to create a VISTA rule. It should be fairly easy for them to setup, but they will charge you for it. And I think they will create one for each report suite. I don't generally recommend going this route unless you really have to, though. Mostly because the cost, but also because you don't have personal visibility into it.
Alternatively, if you do have the tech resources to update the js code, but the cost of throwing another rsid into the mix is an issue (from extra server hits), then you may want to consider replacing all of your report suites with a single global report suite, e.g.
s_account='report-global';
Then, create a Virtual Report Suite for each site. You can go to Components > Virtual Report Suites to set them up. The TL;DR is you create them by pointing at your report-global rsid as the source and then creating a segment based off something unique to the site (e.g. the domain, or maybe some eVar with a site-specific value).
The major downside to going the virtual report suite route is historical data from your previous report suites will not be available in the same place as this new global report suite and its virtual report suites. But it's a "one time migration" thing, and the historical data won't be lost; you'll just have some extra work on your end referencing it in the old rsids, esp if you want to compare historical to current in the new (virtual) risds.
The 2nd major thing to consider is unique limits. Not sure how much traffic / unique values vars get on your sites, but there is a monthly unique value limit you may have to consider with all of the sites going to the same report suite. Beyond looking at tricks to make values less unique on a case by case basis (e.g. removing query param string from URLs), there isn't a good way to solve for this except to stick with separate rsids. Well.. Adobe will increase unique limit on certain vars if you ask them, but it will cost you..
Another alternative to consider is a Rollup report suite. If you go to Admin > Report Suites, where your current report suites are listed. To the left you should see Rollups and an Add link next to it. This will create a Rollup report suite made up of data from one or more report suites.
Note though that a Rollup report suite is not the same as full fledged report suite. Please refer to the link above for full details/limitations, but the main benefit is it won't cost you anything except the couple of minutes to set it up in the interface. But the limitations of it.. the main points of note are you only get aggregated data, data is not deduped between the rsids, and many reports are limited or not available. In practice, I rarely ever see anybody actually go this route because it's too limited. But hey, maybe it's good enough for you.

Database to store & process client logs efficiently

So the context is that I have a client application that generates logs and I want to occasionally upload this data to a backend. The backend will function as an analytics server, storing, processing and displaying this data - so as you can imagine there will be some querying involved.
In terms of data collection peak load, I expect to have about 5k clients, each generating about 50 - 100 lines per day, and I'd like the solution I'm tackling to be able to process that kind of data. If you do the math, thats upwards of 1 million log lines / month.
In terms of data analytics load, it will be fairly low - I expect a couple of us (admins) to run queries to harvest some info once a week or so from all the logs.
My application is currently running RoR + Postgres, though I'm open to using a different dB if it maps better to my needs. Current contenders in my head are MongoDB & Cassandra, but I don't really want to leave Postgres if it can scale to get the job done.
I'd recommend a purpose built tool like logstash for this:
http://logstash.net/
Another alternative would be Apache Flume:
http://flume.apache.org/
For my experiences, you will need an search engine to do troubleshooting and analysis when you have a lot of logs, instead of using database. (Search engine will more faster than database.)
For now, I am using logstash+Elasticsearch+Kibana total solution to build up my Log system.
Logstash is a tool can parse the logs and make it more human
readable.
Elasticsearch is a search engine to do indexing and
searching on your logs.
Kibana is a webUI that you can use it
to communicate with your Elasticsearch.
This is an Kibana Demo website. You can visit it. http://demo.kibana.org/ .
It provides the search interface and analysis tools such as Pie chart, Table, etc.
In my project, My application generates over 1.5 million logs per day. This Log system can handle all this logs.
Enjoy it.
If you are looking for a database solution that will grow with requests, then I would recommend looking beyond Postgres.
Cassandra is really well-suited for time-series data, though key-value stores are not suited for ad-hoc analytics. One idea could be to store your logs in Cassandra, and then roll them up into a different system later.
For straightforward storing-and-displaying of data, take a look at Graphite, a realtime graphing project.
You can create your own custom graphs with Graphite, and save them as dashboards.

Acceptance Tests for a Windows Service

I'm writing a windows service that processes a number of different rss news feeds at regular intervals. These news items will be saved into our database and associated with different objects in the system.
Although there is a set specification on what needs to happen, there is no UI component for the customer to verify.
What's the best way to write acceptance tests for something like this?
Should I create some simple web pages that display a summary of data that needs to be verified?
Since the data is stored into a database the customer can verify it by reading the database with an IDE or by dumping the data to excel/csv.
I would recommend against doing a lot of extra work to make it possible for them to verify the results because they may end up testing the verification procedure more than the real underlying program.
For internal testing, we often rely on logging for testing. We tell testers what logs to look for for good/bad results.

multiple db connections vs. centralized/redundant db

I have a project to create a dashboard that will connect to existing systems as well as create new features based on combining data from the existing systems. For example, the dashboard will be able to generate "orders" containing data merged from "members" (MS Access DB), "employees" (MySQL DB) and "products" (flat file), and there will also be new attributes particular to "orders."
At first I thought it would be most efficient to have my application connect to each of the systems separately and perform cross-vendor joins between the different databases. But then I thought that creating a centralized/redundant db (built with scripts pushing and pulling data between the systems) might also be useful because it would empower some semi-technical staff to use products like OOBase, which can only make a single connection.
Are there any other advantages to creating a centralized/redundant DB like the one I'm talking about? Or are multiple direct connections the best approach?
Thanks in advance for any tips.
To give you are short answer: yes, you want a central data storage.
You don't want to run complex reports on your live database. As your live database will grow you will want to do some housekeeping and clean it up but keep the data for analysi.
You will also want the data to be aggregated so you could perform historical analysis.
For the data which comes from different sources some clean-up will be required. And you will probably need to know how to link your data together and there are quire a lot of things like that you will have to be aware of to do the job properly.
You might consider reading on data warehousing (wikipedia) and business intelligence (wikipedia).
If you want to have 'new features' added to this system you could also look up orchestration (wikipedia. It will allow you to link your heterogeneous business processes together.
All of these are quite specialized and complex disciplines on their own so you might want to have a specialist to consult you.
Be very, very careful to copy lots of data around. If you do, here are some important guidelines:
Make sure that one system is defined as the master and no other system may tamper with the data.
Always copy data from the master to the slaves.
When you copy the data, use a checksum of some kind to make sure all data has been copied. Make sure you can handle "yesterday, the copy failed".
If a slave must make a change, push the change to the master and then use the standard "update" path to merge it back to the slave. Avoid "save change on slave and update the master some time in the future".

Resources