Census data extraction for time series - time-series

I am trying to download the average population for AZ counties using tidycensus, using the code below. How can I download population data for a time series period from 2000-2019 (interpolating for years that do not have decennial census or acs data)
library(tidycensus)
library(tidyverse)
soc.2010 <- get_decennial(geography = "county", state = "AZ", year = 2010, variables = (c(pop="P001001")), survey="sf1")
soc.16 <- get_acs(geography = "county", year=2016, variables = (c(pop="B01003_001")),state="AZ", survey="acs5") %>% mutate(Year = "2016")

You can use the tidycensus function, get_estimates() to get population estimates by county for each year beginning in 2010.
library(tidycensus)
library(dplyr)
get_estimates(
geography = "county",
state = "AZ",
product = "population",
time_series = TRUE
) %>%
filter(DATE >= 3) %>%
mutate(year = DATE + 2007)
#> # A tibble: 300 x 6
#> NAME DATE GEOID variable value year
#> <chr> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 Pima County, Arizona 3 04019 POP 981620 2010
#> 2 Pima County, Arizona 4 04019 POP 988381 2011
#> 3 Pima County, Arizona 5 04019 POP 993052 2012
#> 4 Pima County, Arizona 6 04019 POP 997127 2013
#> 5 Pima County, Arizona 7 04019 POP 1004229 2014
#> 6 Pima County, Arizona 8 04019 POP 1009103 2015
#> 7 Pima County, Arizona 9 04019 POP 1016707 2016
#> 8 Pima County, Arizona 10 04019 POP 1026391 2017
#> 9 Pima County, Arizona 11 04019 POP 1036554 2018
#> 10 Pima County, Arizona 12 04019 POP 1047279 2019
#> # ... with 290 more rows
The API returns somewhat confusing date codes that I've converted to years. See the date code to year mapping for 2019 population estimates for more information.
For years prior to 2010, the Census API uses a different format that is not accessible via tidycensus. But here is an API call that gives you population by county by year for 2000 to 2010:
https://api.census.gov/data/2000/pep/int_population?get=GEONAME,POP,DATE_DESC&for=county:*&in=state:04
["Graham County, Arizona","33356","7/1/2001 population estimate","04","009"],
["Graham County, Arizona","33224","7/1/2002 population estimate","04","009"],
["Graham County, Arizona","32985","7/1/2003 population estimate","04","009"],
["Graham County, Arizona","32703","7/1/2004 population estimate","04","009"],
["Graham County, Arizona","32964","7/1/2005 population estimate","04","009"],
["Graham County, Arizona","33701","7/1/2006 population estimate","04","009"],
["Graham County, Arizona","35175","7/1/2007 population estimate","04","009"],
["Graham County, Arizona","36639","7/1/2008 population estimate","04","009"],
["Graham County, Arizona","37525","7/1/2009 population estimate","04","009"],
["Graham County, Arizona","37220","4/1/2010 Census 2010 population","04","009"],

Related

Google Sheets: Convert Horizontal Transaction Data into Chronological Statement + Combining Columns of Data

On a sheet named, "Performance," I have data concerning stock trades in a row like so:
A B C D E F G H I J
1 TICKER TRADE OPEN DATE TRADE CLOSED DATE SHARES AVG BUY INVESTMENT AVG SALE PROCEEDS PROFIT/LOSS ROIC:
2 ABC 01/05/22 03/31/22 107 $14.22 -$1,521.54 $15.00 $1,605.00 $83.46 5.49%
3 BCA 01/05/22 03/31/22 344 $14.52 -$4,994.88 $15.00 $5,160.00 $165.12 3.31%
4 CAB 01/05/22 03/31/22 526 $12.55 -$6,601.30 $13.00 $6,838.00 $236.70 3.59%
... and so forth ...
Within the same workbook but on a separate sheet named, "Contributions/Withdrawals," I have a list of contributions and withdrawals like so:
A B
1 DATE AMOUNT
2 01/05/22 $700.00
3 02/05/22 $700.00
4 03/05/22 $400.00
5 03/15/22 -$7,000.00
... and so forth ...
I need to convert the first table of trade transactions into a vertical column format exactly like what is in the Contributions/Withdrawals table. (Note that each trade transaction actually represents two transactions, one for opening with its own date, and one for closing with its date.) Finally, I need to stack both tables of transactions in date order to make a combined chronological list of transactions so that I can run an XIRR formula on it.
The resulting table on a sheet named, "Cash Flows," needs to look like this:
A B
1 DATE AMOUNT
2 01/05/22 -$1,521.54
3 01/05/22 -$4,994.88
4 01/05/22 -$6,601.30
5 01/05/22 $700.00
6 02/05/22 $700.00
7 03/05/22 $700.00
8 03/10/22 $400.00
9 03/15/22 -$7000.00
10 03/31/22 $1,605.00
11 03/31/22 $5,160.00
12 03/31/22 $6,838.00
Using the following in cell A2 and B2...
A2 =SORT({Performance!$B$2:$B;Performance!$C$2:$C;'Contributions/Withdrawals'!$A$2:$A})
B2 =SORT({Performance!$F$2:$F;Performance!$H$2:$H;'Contributions/Withdrawals'!$B$2:$B})
...almost gets me there, but the transactions are not lining up with the correct dates. Google Sheets is ordering the amounts from smallest to largest. What I end up with is this:
A B
1 DATE AMOUNT
2 01/05/22 -$7,000.00
3 01/05/22 -$6,602.72
4 01/05/22 -$6,602.39
5 01/05/22 -$6,601.30
6 01/05/22 -$6,596.40
7 01/05/22 -$6,587.10
8 01/05/22 -$4,994.88
9 01/05/22 -$3,315.26
10 01/05/22 -$3,284.91
11 01/05/22 -$1,521.54
12 02/05/22 $400.00
13 03/05/22 $700.00
14 03/10/22 $700.00
15 03/15/22 $700.00
16 03/31/22 $1,605.00
17 03/31/22 $3.249.00
18 03/31/22 $3,731.00
19 03/31/22 $5,160.00
20 03/31/22 $6,348.00
21 03/31/22 $6,532.00
22 03/31/22 $6,786.00
23 03/31/22 $6,838.00
Any help would be appreciated. Thanks!
You are very close indeed! You should join both ranges in order to sort them by the first column:
=SORT({Performance!$B$2:$B;Performance!$C$2:$C;'Contributions/Withdrawals'!$A$2:$A,Performance!$F$2:$F;Performance!$H$2:$H;'Contributions/Withdrawals'!$B$2:$B})
(You may need to change that only comma to a inverted slash if you have another locale settings)

Does XGBoost Regressor handles missing timesteps?

I've a dataframe with daily items selling: the goal is forecasting on future selling for a good warehouse supply. I'm using XGBoost as Regressor.
date
qta
prezzo
year
day
dayofyear
month
week
dayofweek
festivo
2014-01-02 00:00:00
6484.8
1
2014
2
2
1
1
3
1
2014-01-03 00:00:00
5300
1
2014
3
3
1
1
4
1
2014-01-04 00:00:00
2614.9
1.1
2014
4
4
1
1
5
1
2014-01-07 00:00:00
114.3
1.1
2014
7
7
1
2
1
0
2014-01-09 00:00:00
11490
1
2014
9
9
1
2
3
0
The date is also the index of my dataframe. Qta is the label (the dependent variable) and all the others are the features.
As you can see it's a daily sampling but some days are missing (i.e. 5,6,8).
Could it be a problem during fitting and prediction of future days?
Am i supposed to fill the missing days with qta = 0?

Bootstrap confidence intervals on mean by group are NA

I'm trying to construct confidence intervals around each group mean for a plot I've made, but the bootstrap method hasn't worked. I'm sure I'm doing this incorrectly, but the best example I found online for estimating confidence intervals around the mean of each group was:
wet_pivot %>%
select(n_mean, CYR) %>%
group_by(CYR) %>%
summarise(data = list(smean.cl.boot(cur_data(), conf.int = .95, B = 1000, na.rm = TRUE))) %>%
tidyr::unnest_wider(data)
Result:
# A tibble: 13 x 4
CYR Mean Lower Upper
<dbl> <dbl> <dbl> <dbl>
1 2009 0.00697 NA NA
2 2010 0.000650 NA NA
3 2011 0.00288 NA NA
4 2012 0.0114 NA NA
5 2013 0.000536 NA NA
6 2014 0.00350 NA NA
7 2015 0.000483 NA NA
8 2016 0.00245 NA NA
9 2017 0.00292 NA NA
10 2018 0.00253 NA NA
11 2019 0.00196 NA NA
12 2020 0.00502 NA NA
13 2021 0.00132 NA NA
Am I making incorrect assumptions about my data with this method? Even if this worked, is it possible to manually add each confidence interval into a line plot using ggplot?
My data:
> head(wet_pivot)
WYR CYR Season N n_mean n_median sd se
1 2010 2009 WET 59 0.0069680693 0 0.030946706 0.0040289180
2 2011 2010 WET 63 0.0006497308 0 0.002489655 0.0003136671
3 2012 2011 WET 69 0.0028825655 0 0.010097383 0.0012155821
4 2013 2012 WET 70 0.0114108839 0 0.051577935 0.0061647423
5 2014 2013 WET 72 0.0005361741 0 0.003314688 0.0003906397
6 2015 2014 WET 71 0.0034958465 0 0.026606408 0.0031575998

Data Visualization & Machine Learning

Preprocess the data and see the results after and before preprocessing(Report as accuracy)
Draw the following charts:
Corelation chart Heatmap chart
Missing Values Heatmap chart
Line chart/ scatter chart for Country Vs Purchased, Age Vs Purchased and Salary Vs Purchased
Country Age Salary Purchased
France 44 72000 No
Spain 27 48000 Yes
Germany 30 54000 No
Spain 38 61000 No
Germany 40 Yes
France 35 58000 Yes
Spain 52000 No
France 48 79000 Yes
Germany 50 83000 No
France 37 Yes
France 18888 No
Spain 17 67890 Yes
Germany 12000 No
Spain 38 98888 No
Germany 50 Yes
France 35 58000 Yes
Spain 12345 No
France 23 Yes
Germany 55 78456 No
France 43215 Yes
Sometimes it's hard to understand from scatter plot like Country vs Purchased. Three country of your list somehow purhcased. It can be helpful to do heatmap here
import pandas as pd
from matplotlib import pyplot as plt
#read csv using panda
df = pd.read_csv('Data.csv')
copydf = df
#before data preprocessing
print(copydf)
#fill nan value with average of age and salary
df['Age'] = df['Age'].fillna(df['Age'].mean(axis=0))
df['Salary '] = df['Salary'].fillna(df['Salary'].mean(axis=0))
#after data preprocessing
print(df)
plt.figure(1)
# Country Vs Purchased
plt.subplot(221)
plt.scatter(df['Country'], df['Purchased'])
plt.title('Country vs Purchased')
plt.grid(True)
# Age Vs Purchased
plt.subplot(222)
plt.scatter(df['Age'], df['Purchased'])
plt.title('Age vs Purchased')
plt.grid(True)
# Salary Vs Purchased
plt.subplot(223)
plt.scatter(df['Salary'], df['Purchased'])
plt.title('Salary vs Purchased')
plt.grid(True)
plt.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.75,
wspace=0.5)
plt.show()

SPSS Calculate percentiles with weighted average

My background is in databases and SQL coding. I’ve used the CTABLES feature in SPSS a little, mostly for calculating percentiles which is slow in sql. But now I have a data set where I need to calculate percentiles for a weighted average which is not as straightforward, and I can’t figure out if it’s possible in SPSS or not.
I have data similar to the following
Country Region District Units Cost per Unit
USA Central DivisionQ 10 3
USA Central DivisionQ 12 2.5
USA Central DivisionQ 25 1.5
USA Central DivisionQ 6 4
USA Central DivisionA 3 3.25
USA Central DivisionA 76 1.75
USA Central DivisionA 42 1.5
USA Central DivisionA 1 8
USA Eastern DivisionQ 14 3
USA Eastern DivisionQ 25 2.5
USA Eastern DivisionQ 75 1.5
USA Eastern DivisionQ 9 4
USA Eastern DivisionA 100 3.25
USA Eastern DivisionA 4 1.75
USA Eastern DivisionA 33 1.5
USA Eastern DivisionA 17 8
452 51
For every possible segmentation (Country, Country-Region, Country-Region-District, Country-District etc.)
I want to get the Avg. Cost per Unit, ie. Cost per Unit weighted by Units, so that is total SUM(Units*CostPerUnit)/SUM(Units)
And I need to get the 10th, 25th, 50th, 75th, 90th percentiles for each possible segmentation.
The way I do this part in SQL is extract all the rows in the segment, sort and rank by Cost Per Unit. Get a running sum of Units for each row. Determine the ratio of that running sum to the total units, and that percentage determines which row has the Cost Per Unit for that percentile. An example , for Country = USA and Division = Q
Unit Running
Country Units Cost Unit divided by
Per Unit Running Total Units
USA Central DivisionQ 25 1.5 25 0.14 10th
USA Eastern DivisionQ 75 1.5 100 0.56 25th/50
USA Central DivisionQ 12 2.5 112 0.63
USA Eastern DivisionQ 25 2.5 137 0.77 75th
USA Central DivisionQ 10 3 147 0.83
USA Eastern DivisionQ 14 3 161 0.91 90th
USA Central DivisionQ 6 4 167 0.94
USA Eastern DivisionQ 9 4 176 1
This takes a very long time to do for each segment. Is it possible to leverage SPSS to do the same thing more easily?
Use SPLIT FILES (Data > Select Cases) to define the group and then use FREQUENCIES (Analyze > Descriptive Statistics > Frequencies) to calculate the statistics. Suppress the actual frequency tables (/FORMAT=NOTABLE).

Resources