I try to join two pyspark dataframes. One contains my measurement data, the other one contains release information of my measruement equipment. I want to add the release information to the measurement data like this:
Input:
measure data:
logger_id
measure_date
data
394
2018-07-09T09:25:40
some data
394
2018-08-23T09:51:18
other data
394
2019-04-23T09:51:18
other data
398
2018-01-10T12:15:53
more data
398
2019-10-24T08:10:25
other data
release data
logger_id
release_date
release_information
394
2018-07-01T00:00:00
release information
394
2019-04-01T00:00:00
release information
398
2018-01-01T00:00:00
release information
398
2019-07-01T00:00:00
release information
and I want an output like that:
logger_id
measure_date
data
release_date
release_information
394
2018-07-09T09:25:40
some data
2018-07-01T00:00:00
release information
394
2018-08-23T09:51:18
other data
2018-07-01T00:00:00
release information
394
2019-04-23T09:51:18
other data
2019-04-01T00:00:00
release information
398
2018-01-10T12:15:53
more data
2018-01-01T00:00:00
release information
398
2019-10-24T08:10:25
other data
2019-07-01T00:00:00
release information
I've already tried
cond = [release_data.release_date < measure_data.measure_date, release_data.logger_id == measure_data.logger_id]
measure_data.join(release_data, cond, how='fullouter')
But in the resulting dataframe I get the release data with 'null' columns of the measure dataframe
I also considered iterating through my measuredata dataframe and adding the release information for every row, but for it is really large, I don't wanna do that
You can transform release_df to include a column that finds the until when a release is valid, for this lead can be used.
Once the release_valid_end is included, then the join condition would change to find date comparison checks between measure_date and
release_date and release_valid_end.
from datetime import datetime
from pyspark.sql import functions as F
from pyspark.sql import Window as W
measure_data = [(394, datetime.strptime("2018-07-09T09:25:40", "%Y-%m-%dT%H:%M:%S"), "some data",),
(394, datetime.strptime("2018-08-23T09:51:18", "%Y-%m-%dT%H:%M:%S"), "other data",),
(394, datetime.strptime("2019-04-23T09:51:18", "%Y-%m-%dT%H:%M:%S"), "other data",),
(398, datetime.strptime("2018-01-10T12:15:53", "%Y-%m-%dT%H:%M:%S"), "more data",),
(398, datetime.strptime("2019-10-24T08:10:25", "%Y-%m-%dT%H:%M:%S"), "other data",), ]
release_data = [(394, datetime.strptime("2018-07-01T00:00:00", "%Y-%m-%dT%H:%M:%S"), "release information",),
(394, datetime.strptime("2019-04-01T00:00:00", "%Y-%m-%dT%H:%M:%S"), "release information",),
(398, datetime.strptime("2018-01-01T00:00:00", "%Y-%m-%dT%H:%M:%S"), "release information",),
(398, datetime.strptime("2019-07-01T00:00:00", "%Y-%m-%dT%H:%M:%S"), "release information",), ]
measure_df = spark.createDataFrame(measure_data, ("logger_id", "measure_date", "data",))
release_df = spark.createDataFrame(release_data, ("logger_id", "release_date", "release_information",))
world_end_date = datetime.strptime("2999-12-31T00:00:00", "%Y-%m-%dT%H:%M:%S")
window_spec = W.partitionBy("logger_id").orderBy(F.asc("release_date"))
release_validity_df = release_df.withColumn("release_valid_end",
F.lead("release_date", offset=1, default=world_end_date).over(window_spec))
(measure_df.join(release_validity_df,
((measure_df["logger_id"] == release_validity_df["logger_id"]) &
((measure_df["measure_date"] >= release_validity_df["release_date"]) &
(measure_df["measure_date"] < release_validity_df["release_valid_end"]))
))
).select(measure_df["logger_id"], "measure_date", "data", "release_date", "release_information").show()
Output
+---------+-------------------+----------+-------------------+-------------------+
|logger_id| measure_date| data| release_date|release_information|
+---------+-------------------+----------+-------------------+-------------------+
| 398|2018-01-10 12:15:53| more data|2018-01-01 00:00:00|release information|
| 398|2019-10-24 08:10:25|other data|2019-07-01 00:00:00|release information|
| 394|2018-07-09 09:25:40| some data|2018-07-01 00:00:00|release information|
| 394|2018-08-23 09:51:18|other data|2018-07-01 00:00:00|release information|
| 394|2019-04-23 09:51:18|other data|2019-04-01 00:00:00|release information|
+---------+-------------------+----------+-------------------+-------------------+
Related
I've currently got the below set of smoothed data:
print(df_smooth.dropna())`
mean std skew kurtosis peak2peak rms crestFactor \
4 0.247555 2.100961 0.001668 3.024679 20.628402 2.115862 5.066747
5 0.237015 2.062690 -0.000792 3.029156 20.314159 2.076466 5.043114
6 0.230783 2.044657 -0.001680 3.028746 20.219575 2.057846 5.030472
7 0.235838 1.986232 -0.001031 3.025417 19.497090 2.000425 4.960363
8 0.235062 1.984086 -0.001014 3.031342 19.817176 1.998209 4.989612
9 0.238660 1.968814 -0.001608 3.023882 19.340179 1.983427 4.998115
10 0.223305 1.975597 -0.000197 3.045224 19.701747 1.988305 5.135947
11 0.219480 2.007902 -0.002460 3.060428 20.252087 2.020074 5.117502
12 0.214518 2.071287 -0.002944 3.092217 21.489908 2.082439 5.302407
13 0.244281 2.122538 -0.003717 3.094335 21.792449 2.137164 5.271366
14 0.235806 2.161333 -0.003364 3.123866 23.128965 2.174895 5.472129
15 0.233630 2.175946 -0.002682 3.152740 24.045300 2.189226 5.610038
16 0.236764 2.188906 -0.000032 3.203623 24.745386 2.202420 5.772337
17 0.262289 2.205111 0.000350 3.192511 24.708587 2.221785 5.681394
18 0.229795 2.139946 0.001239 3.183109 23.745617 2.152940 5.564731
19 0.243538 2.150018 0.001071 3.170558 23.385026 2.164355 5.427326
20 0.266458 2.097468 -0.000830 3.144338 22.084817 2.115172 5.236667
21 0.280729 2.106302 -0.000618 3.101014 21.434129 2.125517 5.147621
22 0.252042 2.078190 0.000259 3.100911 20.991519 2.093988 5.231684
23 0.252297 2.097652 0.000383 3.126250 21.790854 2.113380 5.378267
24 0.250502 2.078781 0.000042 3.129014 21.559732 2.094428 5.340024
25 0.220506 2.070573 0.001974 3.110477 21.473643 2.082461 5.364519
26 0.204412 2.049979 -0.000306 3.227532 22.975315 2.060236 5.706146
27 0.215429 2.103150 -0.001421 3.275257 23.719901 2.114265 5.660891
28 0.216689 2.137870 -0.001783 3.298750 24.040561 2.148948 5.614089
29 0.208962 2.160487 0.000547 3.349068 24.546959 2.170628 5.732873
30 0.227231 2.267705 0.000101 3.413948 25.958169 2.279131 5.745555
31 0.221097 2.258519 0.001567 3.379193 25.424651 2.269446 5.662354
32 0.204962 2.224569 0.000951 3.458483 25.984242 2.234101 5.862379
33 0.224707 2.283631 0.000046 3.516125 27.410217 2.294934 6.024091
34 0.248792 2.354713 -0.001143 3.630634 29.159253 2.368248 6.197140
35 0.229501 2.339020 -0.000673 3.743356 30.695670 2.350898 6.613011
36 0.255474 2.454993 -0.001164 3.780962 32.480614 2.468843 6.627903
37 0.257979 2.530495 0.000630 3.962767 33.656646 2.544310 6.661273
38 0.232977 2.498537 0.001111 3.931879 32.754947 2.510044 6.557506
39 0.237025 2.392735 -0.000920 3.919665 31.277647 2.405969 6.494115
40 0.243630 2.368295 -0.001569 3.812383 29.306347 2.382131 6.077379
41 0.221252 2.305374 -0.000861 4.032235 29.548822 2.317355 6.292428
42 0.215262 2.254417 -0.002057 3.977328 28.970507 2.266098 6.353168
43 0.208581 2.240020 -0.001403 4.154288 30.121039 2.251270 6.630079
44 0.170230 2.302794 -0.001867 4.307822 31.556097 2.309174 6.838202
45 0.168889 2.353960 -0.001309 4.433633 32.825109 2.360053 6.977719
46 0.163156 2.337222 -0.001097 4.238485 31.344888 2.342934 6.658564
47 0.165685 2.369817 -0.002246 4.151915 31.154929 2.375626 6.438286
48 0.190677 2.552397 -0.003645 4.311166 33.473407 2.559565 6.428513
49 0.210200 2.667889 0.004168 4.495159 35.625185 2.676223 6.500683
I want to use the sckikit learn Mutual Information Classification to test for Monotonicity in this dataset, but am having trouble with the syntax (more specifically around the X-value) and the splitting of the full dataset into test and train sets.
I only want 40% of the dataset to be used at the "test data".
Currently this is the command I have:
X_train,X_test,y_train,y_test=train_test_split(df_smooth.dropna(),
test_size=0.4,
random_state=0)
print(X_train)
This is the error I get:
ValueError: not enough values to unpack (expected 4, got 2)
from sklearn.feature_selection import mutual_info_classif
mutual_info = mutual_info_classif(X_train, y_train)
The output I want is something like this:
Monotonicity bar chart- descending
Where the MIC array is ranked from highest to low.
Using the following command:
from sklearn.feature_selection import mutual_info_classif
mutual_info = mutual_info_classif(X_train, y_train)
mutual_info
I tried extracting the ordered numbers 1-49 from the dataframe (which is what I believe is used as the "x" syntax input into the MCI function), but they don't seem to be part of the dataframe when called with iloc[:,0] (which displays the values in the "mean" column). I don't know how this takes into account the dropped "n/a" line values.
If you're testing for something like "the degree of monotonicity between two variables," you're probably looking for Spearman's rank correlation coefficient, which is implemented in scipy.stats.spearmanr:
MRE:
from io import StringIO
import pandas as pd
from scipy import stats
data = StringIO("""mean,std,skew,kurtosis,peak2peak,rms,crestFactor
0.247555,2.100961,0.001668,3.024679,20.628402,2.115862,5.066747
0.237015,2.062690,-0.000792,3.029156,20.314159,2.076466,5.043114
0.230783,2.044657,-0.001680,3.028746,20.219575,2.057846,5.030472
0.235838,1.986232,-0.001031,3.025417,19.497090,2.000425,4.960363
0.235062,1.984086,-0.001014,3.031342,19.817176,1.998209,4.989612
""")
df = pd.read_csv(data)
for var in df.columns:
print(f"{var} {stats.spearmanr(df[var], range(len(df))).correlation:.2f}")
Comparing the first five values of each column to the strictly monotonic sequence range() yields the following table, suggesting the first few samples are antimonotone:
mean -0.70
std -1.00
skew -0.60
kurtosis 0.60
peak2peak -0.90
rms -1.00
crestFactor -0.90
I'm trying to clean my data in jupyterlab by watching several tutorials, but I keep getting one or the other error every time. So I thought I'd come on stack overflow and ask if someone can help me.
This is the csv file I want to clean: https://1drv.ms/u/s!AvOXB8kb-IHBgjaveis044GVoPpk
I'm building a machine learning model so I want to convert all the object values, but I don't know how to.
EDIT: I tried cleaning the data from scratch.
My code input:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
criminal_data = pd.read_csv('database2.csv')
X = criminal_data.drop(columns=['Agency Type', 'City', 'State',
'Crime Solved'])
y = criminal_data['City']
model = DecisionTreeClassifier()
model.fit(X, y)
criminal_data
The error message:
ValueError Traceback (most recent call
last)
<ipython-input-117-4b6968f9994f> in <module>
6 y = criminal_data['City']
7 model = DecisionTreeClassifier()
----> 8 model.fit(X, y)
9 criminal_data
~\anaconda3\lib\site-packages\sklearn\tree\_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
896 """
897
--> 898 super().fit(
899 X, y,
900 sample_weight=sample_weight,
~\anaconda3\lib\site-packages\sklearn\tree\_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
154 check_X_params = dict(dtype=DTYPE, accept_sparse="csc")
155 check_y_params = dict(ensure_2d=False, dtype=None)
--> 156 X, y = self._validate_data(X, y,
157 validate_separately=(check_X_params,
158 check_y_params))
~\anaconda3\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
428 # :(
429 check_X_params, check_y_params =
validate_separately
--> 430 X = check_array(X, **check_X_params)
431 y = check_array(y, **check_y_params)
432 else:
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
61 extra_args = len(args) - len(all_args)
62 if extra_args <= 0:
---> 63 return f(*args, **kwargs)
64
65 # extra_args > 0
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in
check_array(array, accept_sparse, accept_large_sparse, dtype, order,
copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
614 array = array.astype(dtype, casting="unsafe",
copy=False)
615 else:
--> 616 array = np.asarray(array, order=order, dtype=dtype)
617 except ComplexWarning as complex_warning:
618 raise ValueError("Complex data not supported\n"
~\anaconda3\lib\site-packages\numpy\core\_asarray.py in asarray(a, dtype, order, like)
100 return _asarray_with_like(a, dtype=dtype, order=order,
like=like)
101
--> 102 return array(a, dtype, copy=False, order=order)
103
104
~\anaconda3\lib\site-packages\pandas\core\generic.py in __array__(self, dtype)
1897
1898 def __array__(self, dtype=None) -> np.ndarray:
-> 1899 return np.asarray(self._values, dtype=dtype)
1900
1901 def __array_wrap__(
~\anaconda3\lib\site-packages\numpy\core\_asarray.py in asarray(a, dtype,
order, like)
100 return _asarray_with_like(a, dtype=dtype, order=order,
like=like)
101
--> 102 return array(a, dtype, copy=False, order=order)
103
104
ValueError: could not convert string to float: 'Anchorage'
You are trying to train your model with some data that is not numerical. Before using the model, you need to do encoding. You can try LabelEncoder for that.
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for column_name in X.columns:
if X[column_name].dtype == object:
X[column_name] = le.fit_transform(X[column_name])
else:
pass
If you have a combination of different data types in a row. Try below:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for column_name in X.columns:
X[column_name] = X[column_name].replace(np.nan, 'none', regex=True)
X[column_name] = le.fit_transform(X[column_name].astype(str))
My table:
Report_Period
Entity
Tag
Users Count
Report_Period_M-1
Report_Period_Q-1
...
2017-06-30
entity 1
X
471
2017-05-31
2017-03-31
...
2020-12-31
entity 2
A
135
2020-11-30
2020-09-30
...
2020-11-30
entity 3
X
402
2020-10-31
2020-08-31
...
What I want:
Report_Period
Entity
Tag
Users Count
Users_Count_M-1
Users_Count_Q-1
...
2017-06-30
entity 1
X
471
450
438
...
2020-12-31
entity 2
A
135
122
118
...
2020-11-30
entity 3
X
402
380
380
...
I have have tried this code but it duplicate records! How can I avoid it?
SELECT M."Entity",M."Tag",M."Report_Period",M."Users Count",
M."Report_Period_M-1",M1."Users Count" AS "Users Count M1",
FROM "DB"."SCHEMA"."PERIOD" M, "DB"."SCHEMA"."PERIOD" M1
WHERE M."Report_Period_M-1"= M1."Report_Period"
Your join clause should include the entity column and tag (I suspect)
SELECT M."Entity",
M."Tag",
M."Report_Period",
M."Users Count",
M."Report_Period_M-1",
M1."Users Count" AS "Users Count M1",
FROM "DB"."SCHEMA"."PERIOD" M,
"DB"."SCHEMA"."PERIOD" M1
WHERE M."Report_Period_M-1"= M1."Report_Period"
AND M."Entity" = M1."Entity"
AND M."Tag" = M1."Tag"
I'm beginning with biopython and I have a question about parsing results. I used a tutorial to get involved in this and here is the code that I used:
from Bio.Blast import NCBIXML
for record in NCBIXML.parse(open("/Users/jcastrof/blast/pruebarpsb.xml")):
if record.alignments:
print "Query: %s..." % record.query[:60]
for align in record.alignments:
for hsp in align.hsps:
print " %s HSP,e=%f, from position %i to %i" \
% (align.hit_id, hsp.expect, hsp.query_start, hsp.query_end)
Part of the result obtained is:
gnl|CDD|225858 HSP,e=0.000000, from position 32 to 1118
gnl|CDD|225858 HSP,e=0.000000, from position 1775 to 2671
gnl|CDD|214836 HSP,e=0.000000, from position 37 to 458
gnl|CDD|214836 HSP,e=0.000000, from position 1775 to 2192
gnl|CDD|214838 HSP,e=0.000000, from position 567 to 850
And what I want to do is to sort that result by position of the hit (Hsp_hit-from), like this:
gnl|CDD|225858 HSP,e=0.000000, from position 32 to 1118
gnl|CDD|214836 HSP,e=0.000000, from position 37 to 458
gnl|CDD|214838 HSP,e=0.000000, from position 567 to 850
gnl|CDD|225858 HSP,e=0.000000, from position 1775 to 2671
gnl|CDD|214836 HSP,e=0.000000, from position 1775 to 2192
My input file for rps-blast is a *.xml file.
Any suggestion to proceed?
Thanks!
The HSPs list is just a Python list, and can be sorted as usual. Try:
align.hsps.sort(key = lambda hsp: hsp.query_start)
However, you are dealing with a nested list (each match has a list of HSPs), and you want to sort over all of them. Here making your own list might be best - something like this:
for record in ...:
print "Query: %s..." % record.query[:60]
hits = sorted((hsp.query_start, hsp.query_end, hsp.expect, align.hit_id) \
for hsp in align.hsps for align in record.alignments)
for q_start, q_end, expect, hit_id in hits:
print " %s HSP,e=%f, from position %i to %i" \
% (hit_id, expect, q_start, q_end)
Peter
I'm using the entity framework 4.0 and I'm having some issues with the syntax of my query. I'm trying to join 2 tables and pass a parameter to find the value at the same time.I would like find all of the products in table 2 by finding the correlating value in table 1.
Can someone help me out with syntax please?
Thanks in advance.
sample data
table 1
ID productID categoryID
361 571 16
362 572 17
363 573 16
364 574 19
365 575 26
table 2
productID productCode
571 sku
572 sku
573 sku
574 sku
575 sku
var q = from i in context.table1
from it in context.table2
join <not sure>
where i.categoryID == it.categoryID and < parameter >
select e).Skip(value).Take(value));
foreach (var g in q)
{
Response.Write(g.productID);
}
var q = from i in context.table1
join it in context.table2 on i.productID equals it.productID
where i.categoryID == it.categoryID and it.productCode = xyz
select i).Skip(value).Take(value));