GSoC/2019/StatusReports/DevanshuAgarwal

From KDE Community Wiki

Project Overview

Project Name: Statistical Analysis in Labplot
Purpose: Adding statistically relevant features in labplot.

List of Added Features

I have added the following features for the first evaluation:

  • TTest: Two-Sample Independent, Two Sample Paired, One Sample.
  • ZTest: Two-Sample Independent, Two Sample Paired, One Sample.
  • ANOVA: One Way ANOVA.
  • Levene Test: To check for the assumption of homogeneity of variance between populations

Hypothesis Dock For the above-listed features.

I have added a dock for the hypothesis test, from where the user can select the type of test s/he wants to perform and will then get the list of options based on the test chosen.
Some common list of options based on selected tests are:

  • Selecting columns on which the user wants to perform the tests. The column names are shown on the combo-boxes. These combo-boxes are populated according to the type of values present in the columns. This is done to make sure that the user does not perform a test on wrong columns. For example, if the user selects "Two Sample Independent T-Test", then the first combo-box (for choosing independent variable) is populated by columns who are having only numerical values or exactly two categorical values and the second combo-box is populated with columns who have only numerical values. Sometimes there is the case where the column has numerical values but these values represent classes example 0 and 1, the user can tick on checkbox "Categorical Variable" for such cases.
  • User can change the significance level (α) from the box provided in the dock. The default value set apriori is 0.05.
  • User can change the population mean for One Sample TTest and ZTest. The default value set apriori is 0.
  • User can perform Levene Test to check for homogeneity of variance between populations. This Test is currently used to check for the assumption of "Two Sample Independent TTest" and "One Way ANOVA Test". The Test can be performed by clicking on "Levene's Test" push button.
  • User can select what is an alternate hypothesis and correspondingly the options of null hypothesis changes. From here the user can select whether s/he wants to perform Two-Tail Test or One-Tail Test (Positive tail or Negative Tail). By default, two tail is selected.
  • "Do Test" pushbutton is enabled when all the mandatory options are selected. For example, if the user wants to perform One Way ANOVA, but there are no columns with two or more categorical values, the "Do Test" button will be disabled. This ensures the program does not get crash.

Summary and Results in Hypothesis Test View

A window by name "Hypothesis Test for Spreadsheet" is opened when the user selects the "hypothesis test" option. This window shows the result and summary table. This whole window is divided into three sections.

  1. First Section: It displays the type of test (title) being performed.
  2. Second Section: It displays summary statistics in the form of a table, where common columns are mean, sum, number of values, standard deviation.
  3. Third Section: It displays the final result of the tests. Commonly it displays t-value / f-value, p-value, Null Hypothesis, Alternate Hypothesis, Significance Level and Degree of Freedom. This section also gives the tip of what these values mean if the user hovers the mouse over the value. Currently, it gives the meaning of p-values i.e., based on p-value and significance level (α), it says whether the null hypothesis can be rejected or is there the plausibility for the null hypothesis to be true.

The text in this window is formatted (using HTML) for better user readability.

Backend Programming and Source Code

Backend Programming is done keeping in mind that these tests will be performed on huge data and hence, the computation time for each test is O(n) (where n is the maximum number of rows among selected columns).

In the third section of results and summary view, I have used the tooltip feature of Qt, to give the hints of results. The problem is the provided tooltip feature gives tooltip for the whole widget and not at the corresponding point in the widget. So multiple tooltips can't get placed in the same widget. The solution is:

  • Either subclass the widget and change the functionality.
  • Have multiple widgets (I have used this solution). I have created the array of QLabels (I have not used QLineEdit as these are not HTML aware), and each result line is the separate QLabel.

I have added some helper private functions in "HypothesisTest.cpp" file. These functions avoid silly mistakes and increase the reusability of the code. These functions are:

  • findStats: Gives common statistics like mean, sum, number of values, the standard deviation on getting the pointer to a column.
  • countParitions: It gives the number of classes in the given Independent Variable Column.
  • findStatsCategorical: It also gives common statistics, but here one independent variable (containing categorical classes) and one dependent variable are passed. This is created to ensure O(n) complexity.
  • getPValue: It gives the p-value on giving test_type (T-Test or F-Test), tail_type and corresponding T-value and F-value. It also prints the Alternate and Null Hypothesis in results.
  • getHtmlTable: It takes the number of rows, number of columns and list in row-major fashion containing the data and header values and returns the corresponding HTML Table (in form of QString). It is also created to maintain uniformity in the summary and results in view.
  • getLine: It takes line colour and message as arguments and returns QString containing the formatted HTML line. Default colour used is "black" (if the user does not pass any colour value).
  • printLine: It prints the message in a given colour (default: "black", if the user doesn't pass any colour value), in the given index.
  • printError: It prints the error message.

TODO

Currently, these features are yet to be implemented.

  • All these tests can be performed on the spreadsheet. We have to add the same functionality for database source so that the user doesn't have to copy the database to labplot.
  • Backend For ZTest. Backend Structure is ready but yet to compute p-value on given z-value.
  • Automatic resizing of HypothesisTestDock and also adding vertical and horizontal scroll bars (In small screens, most of the dock is getting truncated).
  • Adding more tips in the results section.
  • Automatic Testing. Currently, I have tested my work manually and verified the results using JASP and online calculator.

Commits

Currently, my commits are on gsoc2019_stats branch. These commits are reviewed on phabricator by my mentor Stefan Gerlach. Currently, the Levene test and One Way ANOVA test are yet to be reviewed on phabricator. Here is the list of my review requests for ANOVA and Levene Test.

The whole list of my review request can be found here.

About Me

Screen Shots