I am dealing with large (20-100GB) tab delimited text files which I am able to import correctly into pandas with the index_col=False option. The Dask dataframe does not support the index_col parameter. I am able to work around, but curious if there is a smarter way to deal with this. Without the index_col=False option, pandas and dask read the columns shifted once to the right - i.e. Col1 aligned to data col2, etc. I believe it is because the data rows have a trailing tab but the header does not.
Here is a sample of the file - ^I is a tab and $ a line ending.
Col1^ICol2^ICol3^ICol4^ICol5^ICol6^ICol7^ICol8$
0^I0^ICODE^I-0.2628^I3.041e-001^I.^I0^I2.1213^I$
Update:
This shows the behavior I want and do not want.
TESTDATA = """Index1\tIndex2\tCodeId\tCol4\tCol5\tCol6\tCol7\tCol8
0\t0\tCODE\t-0.2628\t3.041e-001\t.\t0\t2.1213\t
"""
TESTFILE = "test.txt"
with open(TESTFILE, "w") as text_file:
text_file.write(TESTDATA)
import pandas as pd
df = pd.read_csv(TESTFILE,sep='\t')
print("INCORRECT - Col8 is NaN")
print(df.head())
df = pd.read_csv(TESTFILE,sep='\t', index_col=False)
print("CORRECT - index and code correct and Col4-8 correct")
print(df.head())
Results:
INCORRECT - Col8 is NaN
Index1 Index2 CodeId Col4 Col5 Col6 Col7 Col8
0 0 CODE -0.2628 0.3041 . 0 2.1213 NaN
CORRECT - index and code correct and Col4-8 correct
Index1 Index2 CodeId Col4 Col5 Col6 Col7 Col8
0 0 0 CODE -0.2628 0.3041 . 0 2.1213
Dropping the first column just removes a column which I actually need.
import dask.dataframe as dd
ddf = dd.read_csv(TESTFILE,sep='\t')
print(ddf.compute())
ddf = dd.read_csv(TESTFILE,sep='\t')
res = ddf.drop(columns=['Index1'])
print(res.compute())
Results for both are not what I am looking for.
Also, thanks for your answer, my workaround is similar to the one you propose. It's just that as I share dask with colleagues I hope there are minimal behavior differences between pandas and dask dfs - just checking if I miss something.
Finally, here is my own workaround:
# manage trailing delimiter issue with pandas
# which is more complicated with dask since
# index_col cannot be used
def ddf_cols_mod(file, sep):
df = dd.read_csv(file, sep=sep, nrows=0)
cols = list(df.columns)
cols.append('None')
return cols
ddf = dd.read_csv(file,
sep='\t',
header=0,
names=ddf_cols_mod(file, '\t'))
Related
I'm trying to query/filter rows from a dataset structured like this:
Creator
Title
Barcode
Inv. No.
springer
Cellbio
014678
POL02P14x
springer
Cellbio
026938
POL02P26r
springer
Cellbio
038745
nature
Cellular
026672
POL02P26h
elsevier
Biomed
026678
POL02P26g
elsevier
Biomed
026678
POL02P26g
spring
Cellbit
POL02P147
spring
Cellbit
026938
POL02P26j
spring
Cellbit
038745
I need to return all rows where the value/string in column B(title) is duplicate and when in those duplicate rows at least one string/value in column C(barcode) starts with 014 and at least one starts with 026. If the criteria is not met in column C the next check would be similar in column D (Inv. no.): at least one value string starts with POL02P14 and at least one starts with POL026.
So the basic logic would be something like this:
Select all rows where B is duplicate and
((at least one value in C starts with x and one with y) or ( at least one value in D starts with z and one with W)).
So the desired output should be like this:
Creator
Title
Barcode
Inv. No.
springer
Cellbio
014678
POL02P14x
springer
Cellbio
026938
POL02P26r
springer
Cellbio
038745
spring
Cellbit
POL02P147
spring
Cellbit
026938
POL02P26j
spring
Cellbit
038745
Here is a sample spreadsheet more similar to the actual dataset which is fairly large:
https://docs.google.com/spreadsheets/d/1xj5LnOxIwEmcjnXD0trmvcCKJIGIcfDkARV80Hx5Fvc/edit?usp=sharing
Tried adapting formulas with similar logic but always getting errors or unexpected results either the query logic/syntax is wrong or there is filter/array dimension mismatch.
Some examples(the column references are mixed up here because i was trying to reduce the number of columns) :
=FILTER(query(list!A1:AR, "Select * where C starts with 'POL02P'"), list!B1:B<>"",COUNTIF(list!B1:B,list!B1:B)>1)
={results!A1:AR1;array_constrain(
query(
{Filter({results!A2:AR,results!AR2:AR},REGEXMATCH(results!D2:D, "^POL02P14|POL02P26"));
countif(index(Filter({results!A2:AR,results!AR2:AR},REGEXMATCH(results!D2:D, "^POL02P14|POL02P26")),0,45),
index(Filter({results!A2:AR,results!AR2:AR},REGEXMATCH(results!D2:D, "^POL02P14|POL02P26")),0,45))}
,"Select * where Col46>1")
,9^9,44)}
=query(FILTER({list!A2:A&list!J2:J,list!A2:J,
iferror(
vlookup(list!A2:A&list!J2:J,query(query(filter(list!A2:A&
list!J2:J,REGEXMATCH(list!C2:C, "^POL02P14|POL02P26")),
"select Col4, count(Col4) where Col4 <> '' group by Col4"),
"select Col4 where Col2 >1 "),1,false))},REGEXMATCH(list!C2:C, "^POL02P14|POL02P26")),
"select Col1, Col2, Col3, Col5, Col6, Col7, Col8, Col9, Col10, Col11 where Col12 <> ''
order by Col3 asc, Col11 asc")
Please try this out in your sample sheet:
={results!A1:AR1;FILTER(results!A2:AR,REGEXMATCH(results!B2:B,JOIN("|","^"&LAMBDA(z,LAMBDA(x,y,z,{filter(filter(x,y="014"),xmatch(filter(x,y="014"),filter(x,y="026")));filter(filter(x,z="POL02P14"),xmatch(filter(x,z="POL02P14"),filter(x,z="POL02P26")))})(INDEX(z,,1),INDEX(z,,2),INDEX(z,,3)))((UNIQUE(FILTER({results!B2:B,LEFT(results!C2:C,3),LEFT(results!D2:D,8)},results!B2:B<>"",results!D2:D<>""))))&"$")))}
formula logic at a glance:
filter Col_B (Title) in 4 ways (matches to 014, 026, POL02P14, POL02P26)
capture the Col_B which has both 014 and 026
capture the Col_B which has both POL02P14 and POL02P26
Shortlist the Col_B which is TRUE for either step 2 OR step 3 above
Once the list is finalised join them all for regexmatch with Col_B for the final output.
I'm pretty new with ArrayFormula, have been trying but sometime the formula works, sometimes does not. What I'm trying to do is the combination of ArrayFormula, Countif for searching partial text.
As shown in the worksheet below, there are 10 subjects (column A), each subject has at least one of 4 samples (A,B,C,D) summarized as a string (column B). What I'm trying to do is to find which subject has sample A or B or C or D.
I have tried single formula for each sample, eg cell D3
=IF(COUNTIF($B3,"*"&$D$2&"*")>0,$A3,"")
it returns the correct results. However, when I try arrayformula in cell I3,
=arrayformula(IF(COUNTIF($B3:B,"*"&$D$2&"*")>0,$A3:A,""))
The answers are weird. For example: Subjects (Gamma, Zeta, Eta, Theta) who don't have the sample "A" are shown to have sample "A". And this applies to sample B,C,D too
Not sure what went wrong in here. Here is the link to the worksheet
I wouldn't use Countifs or an array formula. Use filter instead. Put this formula in cell i3.
=Filter(if(REGEXMATCH(B3:B,$D$2),A3:A,""),B3:B<>"")
try:
=INDEX(QUERY(IFERROR(TRIM(SPLIT(FLATTEN(IF(IFERROR(SPLIT(B3:B, ","))="",,
SPLIT(B3:B, ",")&"×"&A3:A)), "×"))),
"select max(Col2) where Col2 is not null group by Col2 pivot Col1"))
or use in row 2 if you want to sort it as in your example:
=INDEX(IFNA(VLOOKUP(A2:A, QUERY(IFERROR(TRIM(SPLIT(FLATTEN(
IF(IFERROR(SPLIT(B3:B, ","))="",,SPLIT(B3:B, ",")&"×"&A3:A)), "×"))),
"select Col2,max(Col2) where Col2 is not null group by Col2
pivot Col1 label Col2'Subjects'"), {2,3,4,5}, 0)))
You can accomplish all four columns of results with a single formula.
Delete all formulas from I3:L3.
Place the following formula into I3:
=ArrayFormula(IF(REGEXMATCH(B3:B,I2:L2),A3:A,))
In plain speech, this read "If anything in B3:B matches a value found in I2:L2, return A3:A in the matching columns(s) at the matching row(s); if not, return null."
This question already has answers here:
Query is ignoring string (non numeric) value
(2 answers)
Closed 5 months ago.
I have a table that is a result of a few smaller tables merged together.
It's a result of searching a few sheets for rows that meet filtering criteria.
I wanted to remove empty rows using QUERY formula but it works in a strange way!
Normally
=QUERY(A1:Z,"Select *",0)
should return a full table. But not in this case.
What I actually try to do is to remove empty rows. I tried:
=QUERY({A1:Z},"Select * where Col5 is not null",0)
as column E is empty only when whole row is empty. But it does not work. It seems to ignore string values and sees only numbers.
Here is dummy table.
https://docs.google.com/spreadsheets/d/12QmFW9vlx4ToHsQYGmkXK4aLD2jlI30FV1wYgxa0V8c/copy
It looks like this:
When I apply Query that should cut empty rows, my result table looks like this:
It seems to cut all the rows without number value in Column A (strange!)
Note: Table is generated by a very long formula that searches multiple sheets. Whenever result is not found in one of sheets, formula returns empty row. So I need a solution to wrap around existing formula. Normally QUERY is a way to go, but not this time.
I know that I can make additional step. Make one more sheet and use Filter:
=filter(Sheet1!A1:Z,Sheet1!E:E<>"")
Anyway this solution adds bulk to my spreadsheet.
If you convert Col E to text (TO_TEXT), you can run the query without worrying about mixed data:
=index(query({Sheet1!A:D,to_text(Sheet1!E:E),Sheet1!F:Z}, "select * where Col5 is not null ",0))
QUERY only returns the predominant data type per column. Your E column has mixed data types (strings and numbers, with numbers being predominant), so anything that is not a number will be a null — and thus ruled out by the QUERY.
As for how to solve it, that would be difficult to impossible to do given your sample spreadsheet only, since we can't see the actual formula that generates the initial output shown in sample Sheet1.
there are ways but "short-formula lovers" will hate it... for example:
=ARRAYFORMULA(IF(ISNUMBER(QUERY(TO_TEXT(Sheet1!A:Z), "where Col5 is not null", 0)*1),
IFERROR(1/(1/QUERY(TO_TEXT(Sheet1!A:Z), "where Col5 is not null", 0)*1)),
QUERY(TO_TEXT(Sheet1!A:Z), "where Col5 is not null", 0)))
or:
=ARRAYFORMULA(IF(QUERY(TO_TEXT(Sheet1!A:Z), "where Col5 is not null", 0)<>"",
IF(ISNUMBER(QUERY(TO_TEXT(Sheet1!A:Z), "where Col5 is not null", 0)*1),
QUERY(TO_TEXT(Sheet1!A:Z), "where Col5 is not null", 0)*1,
QUERY(TO_TEXT(Sheet1!A:Z), "where Col5 is not null", 0)), ))
if you need zeros
or try like this:
=FILTER(Sheet1!A:Z, TRIM(FLATTEN(QUERY(TRANSPOSE(Sheet1!A:Z),,9^9)))<>"")
=FILTER(your_formula, TRIM(FLATTEN(QUERY(TRANSPOSE(your_formula),,9^9)))<>"")
I have two columns (see screenshot)
How can i create a formula that sum the second LATEST column values with a criteria from column A?
For example i need the sum of the last 6 values (from column B) of the cells (in column A) that start with HH, so values starting from the bottom.
I know how to make a sum of all values (from column B) containing HH (from column A)
=SUMIF(A1:A;"HH"&"*";B1:B)
P.S. HH and * are separate because i'll substitute the HH with a cell
but now i need to delimit this to the last N values (let say last 3 values)
P.P.S.
=SUMPRODUCT((COUNTIFS(A1:A;"exact text";ROW(A1:A)*{1;1};">="&ROW(A1:A)*{1;1})<=3)*(A1:A="exact text");B1:B)
This works so far ONLY if i write the exact text, not with values like HH*
Maybe try
=sum(index(query({row(A1:A), A1:B}, "Select Col3 where Col2 contains 'HH' order by Col1 desc limit 6")))
and see if that works?
Note:
*the string HH can be also be in a cell (ex. D1)
=sum(index(query({row(A1:A), A1:B}, "Select Col3 where Col2 contains '"&D1&"' order by Col1 desc limit 6")))
*6 indicates the number of values you want to sum
EDIT: For your locale you'll need to use in G1
=sum(index(query({row($B$1:$B) \ $B$1:$C}; "Select Col3 where Col2 contains '"&E2&"' order by Col1 desc limit 3")))
and fill down. See if that works?
This should also work, not sure if it's any easier to understand/ less complicated than any other approach:
=SUM(SORTN(REGEXMATCH(B:B;E2)*C:C;3;0;ROW(B:B)*REGEXMATCH(B:B;E2);0))
Note the number 3 for the number of values you want from the bottom. and the reference to E2, which is "HH" as on your sample sheet.
use:
=QUERY(FILTER({IFNA(REGEXEXTRACT(SORT(B2:B; ROW(B2:B); 0);
"^([A-Za-z]{1,3})\d"))\SORT(C2:C; ROW(B2:B); 0)}; COUNTIFS(
REGEXEXTRACT(SORT(B2:B; ROW(B2:B); 0); "^([A-Za-z]{1,3})\d");
REGEXEXTRACT(SORT(B2:B; ROW(B2:B); 0); "^([A-Za-z]{1,3})\d");
ROW(H2:H43); "<="&ROW(H2:H43))<=3);
"select Col1,sum(Col2) group by Col1 label sum(Col2)''")
full explanation here
Using Google Sheets I want, within the same document, to import data from one sheet to another using IMPORTRANGE with conditions.
I have tried unsuccessfully:
=IF(IMPORTRANGE("https:URL","Inc Database!B2:B300")="permanent",IMPORTRANGE("htps://URL","Inc Database!A2:A300"),"")
and
=QUERY(IMPORTRANGE("https:/URL", "Inc Database!A2:A300"),"SELECT Col1 WHERE Col1 <> 'permanent'")
and
=FILTER(IMPORTRANGE("URL","Inc Database!A1:A250"),IMPORTRANGE("URL","Inc Database!B1:B250"="venture permanent"))
I want the function to say: Import any values from range A that meet criterion "permanent" in range B.
A | B
_________|_________
Name |type
---------|-------
Henry |Permanent
William |Intern
John |Permanent
I have put a few examples in the following spreadsheet:
e.g. =QUERY(IMPORTRANGE("https://docs.google.com/spreadsheets/d/1LX7JfbGvgBTfmXsYZz0u5J7K0O6oHJSnBowKQJGa9lY/edit#gid=0", "Inc Database!A2:B300"),"SELECT Col1 WHERE not(Col2 = 'Permanent') ")
You need a single quote around the reference to the sheet/tab since there is a space in the name. Using your example:
IMPORTRANGE("https:/URL", "'Inc Database'!A2:A300")
But this will only import column A, so you cannot check against column B
Then use the Query. If you want everything where B is 'Permanent' then you want (untested):
=QUERY(IMPORTRANGE("https:/URL", "'Inc Database'!A2:B"),"SELECT Col1 WHERE Col1 = 'Permanent'")
This will:
Import all of the rows, starting at A2 from the main data sheet to use in the Query().
Via Query, return only those where Col2 (B) contains 'Permanent'