compare two data frames using compare, result index should not be numbers. It should be unique value present in the data frame(particular column) - join

final_output = data_to_compare_1.compare(data_to_compare_2,align_axis=0).rename(index={"self":"old_extract","other":"new_extract"})
I compared two data frames got results really good, but at index i got numbers[12,38,39...]enter image description here, instead i want data frames particular column(unique values in the data frame) at index number.
final_output = data_to_compare_1.compare(data_to_compare_2,align_axis=0).rename(index={"self":"old_extract","other":"new_extract"})
I compared two data frames got results really good, but at index i got numbers[12,38,39...]enter image description here, instead i want data frames particular column(unique values in the data frame) at index number.

Related

Read parquet in chunks according to ordered column index with pyarrow

I have a dataset composed of multiple parquet files clip1.parquet, clip2.parquet,.... Each row corresponds to a point in some frame and there's an ordered column specifying the corresponding frame frame: 1,1,...1,2,2,2...2,3,3...3.... There are several thousand rows for each frame, but the exact number is not necessarily the same. Frame numbers do not reset in each clip.
What is the fastest way to iteratively read all rows belonging to one frame?
Loading the whole dataset to memory is not possible. I assume a standard row filter will check against all rows which is not optimal (I know they are ordered by frame). I was thinking it could be possible to match a row-group for each frame, but wasn't sure if it's a good practice or even possible with different sized groups.
Thanks!
It is reasonable in your case to consider the frame column as your index, and you can specify this when loading. If you scan the metadata of all the files (this is fast for local data, but not on by default), then dask will know the min and max frame values for each file. Therefore, selecting on the index will only read the files which have at least some corresponding values.
df = dd.read_parquet("clip*.parquet", index="frame", calculate_divisions=True)
df[df.index == 1] # so something with this
Alternatively, you can specify filters in readparquet, if you want even more control, and you would make a new dataframe object for each iteration.
Note, however, that a groupby might do what you want, without having to iterate over the frame numbers. Dask is pretty smart about loading only part of the data at a time and aggregating partial results from each partition. How well this works depends on how complicated an algorithm you want to do to each row set.
I should mention that both parquet backend support all these options, you don't specifically need pyarrow.

Get index of second match in Google Sheets

In a sheet made of names and scores, I am trying to build a sheet to display the name of the people with the best scores.
To do so:
I sort the scores
I get the index of the biggest value
I offset the names list with the given index
When I want to get the second biggest value, I only have to get the index of the second biggest value on step 2.
There is a problem if two values are tied for the biggest, as MATCH() will always give me the index of the first value found.
I thought of determining the index of the biggest value, then excluding this index from the range used to determine the second biggest value, but I could not achieve it as the range lengths may be different.
I also thought of using a function or script that returns the Nth index that meets a criteria from a range, but I did not find anything to do so.
Image
Here is an example spreadsheet
https://docs.google.com/spreadsheets/d/1RrUpAjbMBze9L5OqxdyEWBnYXq98LtohdgROF8s68FI/edit?usp=sharing
One way is to add a column number and sort on that as well as on the score, then take the second element in the list:
=ArrayFormula(index(sort(transpose({B1:F3;column(B1:F3)}),3,false,4,true),2,1))
Note that the headers (players' names) are sorted along with their scores.
EDIT
Actually Sort in GS is a stable sort (in other words, according to the documentation 'range is sorted only by the specified columns, other columns are returned in the order they originally appear') so this is sufficient:
=ArrayFormula(index(sort(transpose(B1:F3),3,false),2,1))

Get Range of Cells Value as Display using Microsoft.Office.Interop.Excel

I am programming with C# to access data from a range of cells in Excel Spreadsheet.
I can use the following code to access and return the values of range of cells into an object array.
(object[,])Mysheet.UsedRange.get_Value(XlRangeValueDataType.xlRangeValueDefault);
However, I like to find a mean to return all data as strings (exactly as shown on the spreadsheet) into a string array or putting the values as text into the object array. Are there any mechanism to do that>
Did you try using .Text in the Range object? As far as I know, you will have to iterate over each cell and do it for each of them.
Note that .Text is considerably heavy in terms of performance compared to Value or Value2.
And also note that it is also tricky, .Text returns the text as you would see it if you had Excel visible, so if you have a huge number in a column with a short width what .Text will give you is a lot of ####
Sadly I can't think of another way to get it. Usually I get the raw values and format them properly once a get them all, but that assumes that I know which format is used in which cells.

How to check the size of PFObject

Parse has a limit of 128KB per PFObject. I'm create a PFObject with an array of geo locations (doubles) which I suspect after a while will overflow this 128KB limit. How can I detect the size? Such as splitting the array across multiple PFObjects or as a PFFile.
To detect the size, find the size of your object with an empty array (retrieve and perform size check) and the size with one entry. The empty version will give you the base size then subtract from 128KB and divide by the cost per entry to get the max number of entries.
Do a test to make sure this stores correctly just before you hit the max limit (and fails above it).
It's dangerous to have no limit but I would figure out how large an array you can store within the limit and check against that when adding. If over the limit then you will need to use another object. If the objects have a field in column then your query will return both (or more) objects. Concat their arrays to get all values and only write new data to the object with the array length under the limit.
You could also store each coordinate pair as it's own row and ignore the 128KB limit.

save coordinates for graph in table and then remove them in random order? best approach to do this?

I have a situation and it goes like this:
I need to print out coordinates like the one's used in a maths graph. So (0,0) (0,1) (0,2) and so on. So if the length is specified as 10 and breadth is specified as 20 then the graph region will be all the points from (0,0) till (10,20).
I wish to store these values in a table so that these can be printed out in order.
Later on, there is a scenario that some of these values will get removed so suppose the values removed are (4,5) (4,6) (4,7) and then the main table that was created earlier should not contain these values. And I need to be able to print out the new table with the remaining values.
Till now I have only done the coding to ask for the length and the breadth values.
How should I go ahead with the rest of this?
In case you need any clarification or the question if too confusing then please leave a comment and I will try to make it better.
Any help will be very highly appreciated.
Thank you
There are a few ways of doing this depending on what you want.
The easy way is to use an array of arrays like this:
a = Array.new(11) {Array.new(21) {0}}
This creates an array like a[0][0] to a[10][20], with every item initialized to 0.
To remove an item, set it to nil:
a[4][5] = nil
When you print the array, skip any nil values:
for x in 0..10
for y in 0..20
next if a[x][y]==nil
puts a[x][y]
end
end
If your graph is very large, read about a "sparse matrix" which is how tools like Excel store many cells using less RAM for blank cells:
http://en.wikipedia.org/wiki/Sparse_matrix

Resources