fsds_100719.ds package¶

A shared collection of tools for general use.

fsds_100719.ds.add_dir_to_path(abs_path=None, rel_path=None, verbose=True)[source]¶

Adds the provided path (or current directory if None provided) to sys.path.

Args:: path (str): folder to add to path (May need to be absolute). rel_path (str): relative folder path to be converted to absolute and added. verbose (bool): Controls display of success/failure messages. Default =True

fsds_100719.ds.arr2series(array, series_index=None, series_name='array')[source]¶

Converts an array into a named series.

Args:

array (numpy array): Array to transform. series_index (list, optional): List of values to be used as index.

Defaults to None, a numerical index.

series_name (str, optional): Name for series. Defaults to ‘array’.

Returns:

converted_array: Pandas Series with the name and index specified.

fsds_100719.ds.capture_text(txt)[source]¶

Uses StringIO and sys.stdout to capture print statements.

Args:: txt (str): pass string or command to display a string to capture
Returns:: txt_out (str): captured print statement

fsds_100719.ds.check_column(panda_obj, columns=None, nlargest='all')[source]¶

Prints column name, dataype, # and % of null values, and unique values for the nlargest # of rows (by valuecount_. it will only print results for those columns ******** Params: panda_object: pandas DataFrame or Series columns: list containing names of columns (strings)

Returns: None: prints values only

fsds_100719.ds.check_df_for_columns(df, columns=None)[source]¶

Checks df for presence of columns.

df: pd.DataFrame to find columns in columns: str or list of str. column names

fsds_100719.ds.check_null(df, columns=None, show_df=False)[source]¶

Iterates through columns and checks for null values and displays # and % of column. Params: ************** df: pandas DataFrame

columns: list of columns to check ******> Returns: displayed dataframe

fsds_100719.ds.check_numeric(df, columns=None, unique_check=False, return_list=False, show_df=False)[source]¶

Iterates through columns and checks for possible numeric features labeled as objects. Params: ************** df: pandas DataFrame

unique_check: bool. (default=True): If true, distplays interactive interface for checking unique values in columns.
return_list: bool, (default=False): If True, returns a list of column names with possible numeric types.

******> Returns: dataframe displayed (always), list of column names if return_list=True

fsds_100719.ds.check_unique(df, columns=None)[source]¶

Prints unique values for all columns in dataframe. If passed list of columns, it will only print results for those columns 8************ > Params: df: pandas DataFrame, or pd.Series columns: list containing names of columns (strings)

Returns: None: prints values only

fsds_100719.ds.column_report(df, index_col=None, sort_column='iloc', ascending=True, interactive=False, return_df=False)[source]¶

Displays a DataFrame summary of each column’s: - name, iloc, dtypes, null value count & %, # of 0’s, min, max,med,mean, etc

Args:: df (DataFrame): df to report index_col (column to set as index, str): Defaults to None. sort_column (str, optional): [description]. Defaults to ‘iloc’. ascending (bool, optional): [description]. Defaults to True. as_df (bool, optional): [description]. Defaults to False. interactive (bool, optional): [description]. Defaults to False. return_df (bool, optional): [description]. Defaults to False.
Returns:: column_report (df): Non-styled version of displayed df report

fsds_100719.ds.column_report_qgrid(df, index_col=None, sort_column='iloc', ascending=True, format_dict=None, as_df=False, as_interactive_df=False, show_and_return=True, as_qgrid=True, qgrid_options=None, qgrid_column_options=None, qgrid_col_defs=None, qgrid_callback=None)[source]¶

Returns a datafarme summary of the columns, their dtype, a summary dataframe with the column name, column dtypes, and a decision_map dictionary of datatype. [!] Please note if qgrid does not display properly, enter this into your terminal and restart your temrinal.

‘jupyter nbextension enable –py –sys-prefix qgrid’# required for qgrid ‘jupyter nbextension enable –py –sys-prefix widgetsnbextension’ # only required if you have not enabled the ipywidgets nbextension yet

Default qgrid options:

default_grid_options={

# SlickGrid options ‘fullWidthRows’: True, ‘syncColumnCellResize’: True, ‘forceFitColumns’: True, ‘defaultColumnWidth’: 50, ‘rowHeight’: 25, ‘enableColumnReorder’: True, ‘enableTextSelectionOnCells’: True, ‘editable’: True, ‘autoEdit’: False, ‘explicitInitialization’: True,

# Qgrid options ‘maxVisibleRows’: 30, ‘minVisibleRows’: 8, ‘sortable’: True, ‘filterable’: True, ‘highlightSelectedCell’: True, ‘highlightSelectedRow’: True

}

fsds_100719.ds.compare_duplicates(df1, df2, to_drop=True, verbose=True, return_names_list=False)[source]¶

Compare two dfs for duplicate columns, drop if to_drop=True, useful to us before concatenating when dtypes are different between matching column names and df.drop_duplicates is not an option. Params: ——————– df1, df2 : pandas dataframe suspected of having matching columns to_drop : bool, (default=True)

If True will give the option of dropping columns one at a time from either column.

verbose: bool (default=True): If True prints column names and types, set to false and return_names list=True if only desire a list of column names and no interactive interface.
return_names_list: bool (default=False),: If True, will return a list of all duplicate column names.

Returns: List of column names if return_names_list=True, else nothing.

fsds_100719.ds.display_side_by_side(*args)[source]¶: Display all input dataframes side by side. Also accept captioned styler df object (df_in = df.style.set_caption(‘caption’) Modified from Source: https://stackoverflow.com/questions/38783027/jupyter-notebook-display-two-pandas-tables-side-by-side

fsds_100719.ds.find_outliers_zscore(col)[source]¶

Use scipy to calcualte absoliute Z-scores and return boolean series where True indicates it is an outlier Args:

col (Series): a series/column from your DataFrame

Returns:: idx_outliers (Series): series of True/False for each row in col

Ex: >> idx_outs = find_outliers(df[‘bedrooms’]) >> df_clean = df.loc[idx_outs==False]

fsds_100719.ds.get_source_code_markdown(function)[source]¶: Retrieves the source code as a string and appends the markdown python syntax notation

fsds_100719.ds.ihelp(function_or_mod, show_help=True, show_code=True, return_code=False, markdown=True, file_location=False)[source]¶: Call on any module or functon to display the object’s help command printout AND/OR soruce code displayed as Markdown using Python-syntax

fsds_100719.ds.ihelp_menu(function_list, box_style='warning', to_embed=False)[source]¶

Creates a widget menu of the source code and and help documentation of the functions in function_list.

Args:: function_list (list): list of function object or string names of loaded function. to_embed (bool, optional): Returns interface (layout,output) if True. Defaults to False. to_file (bool, optional): Save . Defaults to False. json_file (str, optional): [description]. Defaults to ‘ihelp_output.txt’.
Returns:: full_layout (ipywidgets GridBox): Layout of interface. output ()

fsds_100719.ds.inspect_df(df, n_rows=3, verbose=True)[source]¶

EDA: Show all pandas inspection tables. Displays df.head(), df.info(), df.describe(). By default also runs check_null and check_numeric to inspect columns for null values and to check string columns to detect numeric values. (If verbose==True) Parameters:

df(dataframe):

dataframe to inspect

n_rows:

number of header rows to show (Default=3).

verbose:

If verbose==True (default), check_null and check_numeric.

Ex: inspect_df(df,n_rows=4)

fsds_100719.ds.inspect_variables(local_vars=None, sort_col='size', exclude_funcs_mods=True, top_n=10, return_df=False, always_display=True, show_how_to_delete=False, print_names=False)[source]¶

Displays a dataframe of all variables and their size in memory, with the largest variables at the top.

Args:: local_vars (locals(): Must call locals() as first argument. sort_col (str, optional): column to sort by. Defaults to ‘size’. top_n (int, optional): how many vars to show. Defaults to 10. return_df (bool, optional): If True, return df instead of just showing df.Defaults to False. always_display (bool, optional): Display df even if returned. Defaults to True. show_how_to_delete (bool, optional): Prints out code to copy-paste into cell to del vars. Defaults to False. print_names (bool, optional): [description]. Defaults to False.
Raises:: Exception: if locals() not passed as first arg

Example Usage: # Must pass in local variables >> inspect_variables(locals()) # To see command to delete list of vars” >> inspect_variables(locals(),show_how_to_delete=True)

fsds_100719.ds.is_var(name)[source]¶

fsds_100719.ds.list2df(list, index_col=None, caption=None, return_df=True, df_kwds={})[source]¶

Quick turn an appened list with a header (row[0]) into a pretty dataframe.

Args

list (list of lists): index_col (string): name of column to set as index; None (Default) has integer index. set_caption (string): show_and_return (bool):

EXAMPLE USE: >> list_results = [[“Test”,”N”,”p-val”]]

# … run test and append list of result values …

>> list_results.append([test_Name,length(data),p])

## Displays styled dataframe if caption: >> df = list2df(list_results, index_col=”Test”,

set_caption=”Stat Test for Significance”)

fsds_100719.ds.reload(mod)[source]¶

Reloads the module from file without restarting kernel.

Args:: mod (loaded mod or list of mod objects): name or handle of package (i.e.,[ pd, fs,np])
Returns:: reload each model.

Example: # You pass in whatever name you imported as. import my_functions_from_file as mf # after editing the source file: # mf.reload(mf)

fsds_100719.ds.save_ihelp_to_file(function, save_help=False, save_code=True, as_md=False, as_txt=True, folder='readme_resources/ihelp_outputs/', filename=None, file_mode='w')[source]¶: Saves the string representation of the ihelp source code as markdown. Filename should NOT have an extension. .txt or .md will be added based on as_md/as_txt. If filename is None, function name is used.

fsds_100719.ds.show_del_me_code(called_by_inspect_vars=False)[source]¶: Prints code to copy and paste into a cell to delete vars using a list of their names. Companion function inspect_variables(locals(),print_names=True) will provide var names tocopy/paste

fsds_100719.ds.show_off_vs_code()[source]¶

Submodules¶

fsds_100719.ds.flatiron_stats module¶

fsds_100719.ds.flatiron_stats.Cohen_d(group1, group2, correction=False)[source]¶

Compute Cohen’s d d = (group1.mean()-group2.mean())/pool_variance. pooled_variance= (n1 * var1 + n2 * var2) / (n1 + n2)

Args:

group1 (Series or NumPy array): group 1 for calculating d group2 (Series or NumPy array): group 2 for calculating d correction (bool): Apply equation correction if N<50. Default is False.

Url with small ncorrection equation:

https://www.statisticshowto.datasciencecentral.com/cohens-d/

Returns:

d (float): calculated d value

INTERPRETATION OF COHEN’s D: > Small effect = 0.2 > Medium Effect = 0.5 > Large Effect = 0.8

fsds_100719.ds.flatiron_stats.evaluate_PDF(rv, x=4)[source]¶: Input: a random variable object, standard deviation output : x and y values for the normal distribution

fsds_100719.ds.flatiron_stats.find_outliers(df, col=None, report=True)[source]¶

` Uses Tukey’s Interquartile Range Method to find outliers.

threshold = 1.5 * IQR
- Lower threshold = [25% quartile] - threshold
- Upper threshold = [75% quartile] + threshold
Outliers are below the lower or above the upper threshold.

Returns a series of T/F for each row for slicing outliers: df[idx_out]

EXAMPLE USE: >> idx_outs = find_outliers_df(df,col=’AdjustedCompensation’) >> good_data = data[~idx_outs].copy()

fsds_100719.ds.flatiron_stats.overlap_superiority(group1, group2, n=1000)[source]¶

Estimates overlap and superiority based on a sample.

group1: scipy.stats rv object group2: scipy.stats rv object n: sample size

fsds_100719.ds.flatiron_stats.p_value_welch_ttest(a, b, two_sided=False)[source]¶: Calculates the p-value for Welch’s t-test given two samples. By default, the returned p-value is for a one-sided t-test. Set the two-sided parameter to True if you wish to perform a two-sided t-test instead.

fsds_100719.ds.flatiron_stats.plot_pdfs(cohen_d=2)[source]¶

Plot PDFs for distributions that differ by some number of stds.

cohen_d: number of standard deviations between the means

fsds_100719.ds.flatiron_stats.welch_df(a, b)[source]¶

fsds_100719.ds.flatiron_stats.welch_t(a, b)[source]¶

fsds_100719.ds.regression_project module¶

A Collection of functions from ft study group for section 25.

fsds_100719.ds.regression_project.diagnose_model(model)[source]¶

Displays the QQplot and residuals of the model. Args:

model (statsmodels ols): A fit statsmodels ols model.

Returns:: fig (Figure): Figure object for output figure ax (list): List of axes for subplots.

fsds_100719.ds.regression_project.find_outliers_IQR(df, col)[source]¶

Use Tukey’s Method of outlier removal AKA InterQuartile-Range Rule and return boolean series where True indicates it is an outlier. - Calculates the range between the 75% and 25% quartiles - Outliers fall outside upper and lower limits, using a treshold of 1.5*IQR the 75% and 25% quartiles.

IQR Range Calculation:: res = df.describe() IQR = res[‘75%’] - res[‘25%’] lower_limit = res[‘25%’] - 1.5*IQR upper_limit = res[‘75%’] + 1.5*IQR
Args:: df ([type]): [description] col ([type]): [description]
Returns:: [type]: [description]

fsds_100719.ds.regression_project.find_outliers_Z(df, col)[source]¶

Use scipy to calcualte absoliute Z-scores and return boolean series where True indicates it is an outlier

Args:: df (Frame): DataFrame containing column to analyze col (str): Name of column to test.
Returns:: idx_outliers (Series): series of True/False for each row in col

Ex: >> idx_outs = find_outliers(df[‘bedrooms’]) >> df_clean = df.loc[idx_outs==False]

fsds_100719.ds.regression_project.make_ols_f(df, target='price', cat_cols=[], col_list=None, show_summary=True, exclude_cols=[])[source]¶

Uses the formula api of Statsmodels for ordinary least squares regression.

Args:

df (Frame): data target (str, optional): Column to predict. Defaults to ‘price’. cat_cols (list, optional): Columns to treat as categorical (and one-hot). col_list ([type], optional): List of columns to use. Defaults to all columns besides exclude_cols. show_summary (bool, optional): Display the model.summary() before returning model. Defaults to True. exclude_cols (list, optional): List of column names to exclude. Defaults to [].

Note: if a column name doesn’t appear in the dataframe, there will be no error nor warning message.

Returns:

model: The fit statsmodels OLS model

fsds_100719.ds.regression_project.vif_ols(df, exclude_col=None, cat_cols=[])[source]¶

Performs variance inflation factor analysis on all columns in dataframe to identify Multicollinear data. The target column (indicated by exclude_col parameter)

Args:: df (Frame): data exclude_col (str): Column to exclude from OLS model. (for VIF calculations). cat_cols (list, optional): List of columns to treat as categories for make_ols_f
Returns:: res (Framee): DataFrame with results of VIF modeling (VIF and R2 score for each feature)

fsds_100719.ds.tsa module¶

fsds_100719.ds.tsa.calc_bollinger_bands(ts, window=20, col=None)[source]¶: Calculates Bollinger Bands for time series. If ts is a dataframe, col specifies data. Normally used for financial/stock market data and uses 20 days for rolling calculations.

fsds_100719.ds.tsa.calc_bollinger_bands_plot(ts, window=20, col=None, figsize=(10, 6), set_kws={'title': 'Bollinger Bands', 'ylabel': 'House Price ($)'})[source]¶: Calculates Bollinger Bands for time series. If ts is a dataframe, col specifies data. Normally used for financial/stock market data and uses 20 days for rolling calculations.

fsds_100719.ds.tsa.stationarity_check(TS, plot=True, col=None, rollwindow=8)[source]¶

Performs the Augmented Dickey-Fuller unit root test on a time series.

The null hypothesis of the Augmented Dickey-Fuller is that there is a unit root, with the alternative that there is no unit root.
- A unit root (also called a unit root process or a difference stationary process)
is a stochastic trend in a time series, sometimes called a “random walk with drift”; - If a time series has a unit root, it shows a systematic pattern that is unpredictable, and non-stationary.

From: https://learn.co/tracks/data-science-career-v2/module-4-a-complete-data-science-project-using-multiple-regression/working-with-time-series-data/time-series-decomposition