GSoC/2019/StatusReports/DevanshuAgarwal: Difference between revisions

From KDE Community Wiki
< GSoC‎ | 2019‎ | StatusReports
No edit summary
Line 5: Line 5:
== List of Added Features ==
== List of Added Features ==
I have added the following features for the first evaluation:
I have added the following features for the first evaluation:
* '''TTest:''' Two-Sample Independent, Two Sample Paired, One Sample.
* '''TTest'''
* '''ZTest:''' Two-Sample Independent, Two Sample Paired, One Sample.
**Two-Sample Independent
* '''ANOVA:''' One Way ANOVA.
**Two Sample Paired
**One Sample  
 
* '''ZTest'''
**Two-Sample Independent
**Two Sample Paired
**One Sample
 
* '''ANOVA'''
**One Way ANOVA
**TWo Way ANOVA
* '''Levene Test:''' To check for the assumption of homogeneity of variance between populations
* '''Levene Test:''' To check for the assumption of homogeneity of variance between populations
* '''Correlation Coefficient'''
**Pearson's R
**Kendall's Tau
**Spearman Rank
**Chi-Square Test for Independence
== Frontend ==
Since a major part of the project aims to make the user comfortable while finding statistics, the special attention was given to frontend. The two major components of Frontend are Dock Widgets (Hypothesis Dock and Correlation Coefficient Dock) and Test Views (Hypothesis Test view and Correlation Coefficient Test View).


== Hypothesis Dock For the above-listed features. ==
== Dock Widgets ==  
I have added a dock for the hypothesis test, from where the user can select the type of test s/he wants to perform and will then get the list of options based on the test chosen.<br />
Dock Widgets provides an interface to the user to select from various options which are required to perform a test. There are two such dock widgets created during the project course (Hypothesis Test Dock and Correlation Coefficient Dock) which appear on the right side of the window according to the test chosen by the user. Some common and important elements in both the dock widgets are listed below with the brief description:
Some common list of options based on selected tests are:
* '''Name:''' It is the name of dock widget. It can be edited by the user.
* Selecting columns on which the user wants to perform the tests. The column names are shown on the combo-boxes. These combo-boxes are populated according to the type of values present in the columns. This is done to make sure that the user does not perform a test on wrong columns. For example, if the user selects "Two Sample Independent T-Test", then the first combo-box (for choosing independent variable) is populated by columns who are having only numerical values or exactly two categorical values and the second combo-box is populated with columns who have only numerical values. Sometimes there is the case where the column has numerical values but these values represent classes example 0 and 1, the user can tick on checkbox "Categorical Variable" for such cases.
* '''Comment:''' Any comment which the user wants to add for future reference.
* User can change the significance level (α) from the box provided in the dock. The default value set apriori is 0.05.
* '''Data'''
* User can change the population mean for One Sample TTest and ZTest. The default value set apriori is 0.
** '''Source:''' Data source type: Spreadsheet and Database. Currently, only spreadsheet is supported.
* User can perform Levene Test to check for homogeneity of variance between populations. This Test is currently used to check for the assumption of "Two Sample Independent TTest" and "One Way ANOVA Test". The Test can be performed by clicking on "Levene's Test" push button.
** '''Spreadsheet:''' This option appears when the "Spreadsheet" is chosen in "Source". It provides the name of the spreadsheet on which the chosen test will be performed. The user can change the spreadsheet chosen.
* User can select what is an alternate hypothesis and correspondingly the options of null hypothesis changes. From here the user can select whether s/he wants to perform Two-Tail Test or One-Tail Test (Positive tail or Negative Tail). By default, two tail is selected.  
* '''Test'''
* "Do Test" pushbutton is enabled when all the mandatory options are selected. For example, if the user wants to perform One Way ANOVA, but there are no columns with two or more categorical values, the "Do Test" button will be disabled. This ensures the program does not get crash.
** '''Type:''' Type of Test the user wants to perform
** '''Sub-type:''' Subtype of the test chosen from "Type" option. It will be shown only when there are subtypes for the test-type chosen.  
** '''Calculate Statistics for Spreadsheet:''' This checkbox is useful whenever the user doesn't have access to the whole data but have a summary of the data like its number of elements, mean of data, standard-deviation or contingency table. Uncheck this checkbox (default: checked) whenever the user wants to perform tests on statistics or the contingency table. On unchecking the box a [[GSoC/2019/StatusReports/DevanshuAgarwal#Statistic_Table|Statistic Table]] will appear. For now, this feature is supported for Two-Sample Independent Z-Test and Chi-Square Test for Independence.
** '''Number of Rows:''' This option is shown when "Calculate Statistics from Spreadsheet" checkbox is unchecked and the user has data in the form of the contingency table. The user can change its default value and dynamically the statistic table will change the number of rows in it.
** '''Number of Columns:'''  Similar to "Number of Rows" option for changing the number of columns.
* '''Variable:''' It is visible when "Calculate Statistics from Spreadsheet" is checked.
** '''Independent Var. 1, Independent Var. 2:''' From here, the user can select columns of a spreadsheet on which the test is to be performed. This combo-box only shows columns which are valid with respect to test selected. The labels and number of such combo-box will change automatically according to the columns/options chosen. For example, whenever "Indpenedent Var.1" is intended to contain categorical labels, the label "Independent Var. 2" will get changed to "Dependent Var. 1".
** '''Independent Var. 1 Categorical:''' This checkbox appears when there is a possibility for column selected in Independent Var. 1 to be categorical such that the "Independent Var. 2" can act as the dependent variable. Check this checkbox for such a case.  
** '''Recalculate:''' The user should press this push-button after selecting all the preferred options from the dock. After clicking on this button Test View Widget will get populated by results and statistics. This push-button is disabled when no column is selected in at least one of  "Independent Vars" combo boxes.
<br>
There are a few more options which are specific to the Test chosen. These are listed below with a very brief description:
* '''Levene's Test:''' The user can perform Levene's test by clicking on it. It is visible for Two-Sample Independent T-Test and One-Way ANOVA. It shows similar behaviour as "Recalculate".  
* '''Equal Variance:''' This checkbox should be checked when the user wants the program to assume homogeneity of variance between populations. This assumption can be checked by performing the Levene Test on the data. Hence It will be visible for all the cases when "Levene's Test" button is visible.  
* '''Hypothesis'''
** '''Null:''' The user can see the null hypothesis for T-Test and Z-Test. User can't change it directly but can only do so using the "Alternate" hypothesis option.
** '''Alternate''' The user can select alternate hypothesis for T-Test and Z-Test. The changes will be reflected in the "Null" hypothesis dynamically.  
** '''μₒ:''' The user can set population mean for One-Sample T-Test and Z-Test. The default value set is 0.
** '''α:''' The user can set significance level for all Hypothesis Tests. The default value set is 0.05.
<br>
Screenshots for dock widgets:
<gallery>
TwoSampleIndependentTTest dock.png| Hypothesis Test Dock (Test: Two-Sample Independent T-Test)
TwoWayAnova dock.png| Hypothesis Test Dock (Test: Two Way ANOVA)
ChiSquareIndependeceTest dock.png| Correlation Coefficient Dock (Test: Chi-Square Test for Independence)
PearsonR dock.png| Correlation Coefficient Dock (Test: Pearson's R)
</gallery>


== Summary and Results in Hypothesis Test View ==  
== Test View ==
A window by name "Hypothesis Test for Spreadsheet" is opened when the user selects the "hypothesis test" option. This window shows the result and summary table.
''' Content under this section is yet to be added '''
This whole window is divided into three sections.
# First Section: It displays the type of test (title) being performed.
# Second Section: It displays summary statistics in the form of a table, where common columns are mean, sum, number of values, standard deviation.
# Third Section: It displays the final result of the tests. Commonly it displays t-value / f-value, p-value, Null Hypothesis, Alternate Hypothesis, Significance Level and Degree of Freedom. This section also gives the tip of what these values mean if the user hovers the mouse over the value. Currently, it gives the meaning of p-values i.e., based on p-value and significance level (α), it says whether the null hypothesis can be rejected or is there the plausibility for the null hypothesis to be true.


The text in this window is formatted (using HTML) for better user readability.
<translate><span id="Statistic Table"></span> </translate>


== Backend Programming and Source Code ==  
== Statistic Table ==
Backend Programming is done keeping in mind that these tests will be performed on huge data and hence, the computation time for each test is O(n) (where n is the maximum number of rows among selected columns). <br />
''' Content under this section is yet to be added '''


In the third section of results and summary view, I have used the tooltip feature of Qt, to give the hints of results. The problem is the provided tooltip feature gives tooltip for the whole widget and not at the corresponding point in the widget. So multiple tooltips can't get placed in the same widget. The solution is:
== Backend ==
* Either subclass the widget and change the functionality.
''' Content under this section is yet to be added '''
* Have multiple widgets (I have used this solution). I have created the array of QLabels (I have not used QLineEdit as these are not HTML aware), and each result line is the separate QLabel. <br />


I have added some helper private functions in "HypothesisTest.cpp" file. These functions avoid silly mistakes and increase the reusability of the code. These functions are:
== Demonstrations ==
* '''findStats:''' Gives common statistics like mean, sum, number of values, the standard deviation on getting the pointer to a column.
''' Content under this section is yet to be added '''
* '''countParitions:''' It gives the number of classes in the given Independent Variable Column.
* '''findStatsCategorical:''' It also gives common statistics, but here one independent variable (containing categorical classes) and one dependent variable are passed. This is created to ensure O(n) complexity.
* '''getPValue:''' It gives the p-value on giving test_type (T-Test or F-Test), tail_type and corresponding T-value and F-value. It also prints the Alternate and Null Hypothesis in results.
* '''getHtmlTable:''' It takes the number of rows, number of columns and list in row-major fashion containing the data and header values and returns the corresponding HTML Table (in form of QString). It is also created to maintain uniformity in the summary and results in view.
* '''getLine:''' It takes line colour and message as arguments and returns QString containing the formatted HTML line. Default colour used is "black" (if the user does not pass any colour value).
* '''printLine:''' It prints the message in a given colour (default: "black", if the user doesn't pass any colour value), in the given index.
* '''printError:''' It prints the error message.


== TODO ==  
== TODO ==  
Currently, these features are yet to be implemented.
* Add more tooltips to Result View
* All these tests can be performed on the spreadsheet. We have to add the same functionality for database source so that the user doesn't have to copy the database to labplot.  
* Check for assumptions using various tests (like Levene's Test).
* Backend For ZTest. Backend Structure is ready but yet to compute p-value on given z-value.
* Reimplement above features when data source type is Database.  
* Automatic resizing of HypothesisTestDock and also adding vertical and horizontal scroll bars (In small screens, most of the dock is getting truncated).
* Integrate various tests in one workbook to show a summary to the user in few clicks.
* Adding more tips in the results section.
* All other minor TODOs are already written as comments in source code itself.  
* Automatic Testing. Currently, I have tested my work manually and verified the results using JASP and [https://www.socscistatistics.com/ online calculator].
 
== Commits ==  
== Commits ==  
Currently, my commits are on [https://cgit.kde.org/labplot.git/log/?h=gsoc2019_stats gsoc2019_stats branch].
My Commits: [https://cgit.kde.org/labplot.git/log/?h=gsoc2019_stats&qt=author&q=Devanshu+Agarwal https://cgit.kde.org/labplot.git/log/?h=gsoc2019_stats&qt=author&q=Devanshu+Agarwal] <br>
These commits are reviewed on phabricator by my mentor Stefan Gerlach. Currently, the Levene test and One Way ANOVA test are yet to be reviewed on phabricator. Here is the list of my review requests for ANOVA and Levene Test.  
These commits are reviewed on phabricator by my mentors Stefan Gerlach and Alexander Semke. <br>
* [https://phabricator.kde.org/D21977 One Way ANOVA Test]
<br>
* [https://phabricator.kde.org/D21857 Levene Test]
Review Request: [https://phabricator.kde.org/p/devanshuagarwal/ https://phabricator.kde.org/p/devanshuagarwal/].
 
The whole list of my review request can be found [https://phabricator.kde.org/p/devanshuagarwal/ here].


== About Me ==  
== About Me ==  
* '''Name:''' Devanshu Agarwal
* '''Name:''' Devanshu Agarwal
* '''Mentors:''' Stefan Gerlach, Alexander Semke
* '''Mentors:''' Stefan Gerlach, Alexander Semke
* '''Email:''' ​[email protected], ​ [email protected]
* '''Email:''' ​[email protected], ​ [email protected]
* '''Github Id:​''' ​https://github.com/agdeva8
* '''Github Id:​''' ​https://github.com/agdeva8
* '''IRC nickname:''' agdeva8
* '''IRC nickname:''' agdeva8
== Screen Shots ==
<gallery>
hypothesis_test_dock.png|Hypothesis Test Dock
two_sample_independent_ttest_1.png|Two Sample Independent T-Test Result
levene_test_1.png|Levene Test Result
two_sample_independent_ttest_full.png|Two Sample Independent T-Test Full Page
one_way_anova_1.png|One Way ANOVA Result
</gallery>

Revision as of 19:52, 20 August 2019

Project Overview

Project Name: Statistical Analysis in Labplot
Purpose: Adding statistically relevant features in labplot.

List of Added Features

I have added the following features for the first evaluation:

  • TTest
    • Two-Sample Independent
    • Two Sample Paired
    • One Sample
  • ZTest
    • Two-Sample Independent
    • Two Sample Paired
    • One Sample
  • ANOVA
    • One Way ANOVA
    • TWo Way ANOVA
  • Levene Test: To check for the assumption of homogeneity of variance between populations
  • Correlation Coefficient
    • Pearson's R
    • Kendall's Tau
    • Spearman Rank
    • Chi-Square Test for Independence

Frontend

Since a major part of the project aims to make the user comfortable while finding statistics, the special attention was given to frontend. The two major components of Frontend are Dock Widgets (Hypothesis Dock and Correlation Coefficient Dock) and Test Views (Hypothesis Test view and Correlation Coefficient Test View).

Dock Widgets

Dock Widgets provides an interface to the user to select from various options which are required to perform a test. There are two such dock widgets created during the project course (Hypothesis Test Dock and Correlation Coefficient Dock) which appear on the right side of the window according to the test chosen by the user. Some common and important elements in both the dock widgets are listed below with the brief description:

  • Name: It is the name of dock widget. It can be edited by the user.
  • Comment: Any comment which the user wants to add for future reference.
  • Data
    • Source: Data source type: Spreadsheet and Database. Currently, only spreadsheet is supported.
    • Spreadsheet: This option appears when the "Spreadsheet" is chosen in "Source". It provides the name of the spreadsheet on which the chosen test will be performed. The user can change the spreadsheet chosen.
  • Test
    • Type: Type of Test the user wants to perform
    • Sub-type: Subtype of the test chosen from "Type" option. It will be shown only when there are subtypes for the test-type chosen.
    • Calculate Statistics for Spreadsheet: This checkbox is useful whenever the user doesn't have access to the whole data but have a summary of the data like its number of elements, mean of data, standard-deviation or contingency table. Uncheck this checkbox (default: checked) whenever the user wants to perform tests on statistics or the contingency table. On unchecking the box a Statistic Table will appear. For now, this feature is supported for Two-Sample Independent Z-Test and Chi-Square Test for Independence.
    • Number of Rows: This option is shown when "Calculate Statistics from Spreadsheet" checkbox is unchecked and the user has data in the form of the contingency table. The user can change its default value and dynamically the statistic table will change the number of rows in it.
    • Number of Columns: Similar to "Number of Rows" option for changing the number of columns.
  • Variable: It is visible when "Calculate Statistics from Spreadsheet" is checked.
    • Independent Var. 1, Independent Var. 2: From here, the user can select columns of a spreadsheet on which the test is to be performed. This combo-box only shows columns which are valid with respect to test selected. The labels and number of such combo-box will change automatically according to the columns/options chosen. For example, whenever "Indpenedent Var.1" is intended to contain categorical labels, the label "Independent Var. 2" will get changed to "Dependent Var. 1".
    • Independent Var. 1 Categorical: This checkbox appears when there is a possibility for column selected in Independent Var. 1 to be categorical such that the "Independent Var. 2" can act as the dependent variable. Check this checkbox for such a case.
    • Recalculate: The user should press this push-button after selecting all the preferred options from the dock. After clicking on this button Test View Widget will get populated by results and statistics. This push-button is disabled when no column is selected in at least one of "Independent Vars" combo boxes.


There are a few more options which are specific to the Test chosen. These are listed below with a very brief description:

  • Levene's Test: The user can perform Levene's test by clicking on it. It is visible for Two-Sample Independent T-Test and One-Way ANOVA. It shows similar behaviour as "Recalculate".
  • Equal Variance: This checkbox should be checked when the user wants the program to assume homogeneity of variance between populations. This assumption can be checked by performing the Levene Test on the data. Hence It will be visible for all the cases when "Levene's Test" button is visible.
  • Hypothesis
    • Null: The user can see the null hypothesis for T-Test and Z-Test. User can't change it directly but can only do so using the "Alternate" hypothesis option.
    • Alternate The user can select alternate hypothesis for T-Test and Z-Test. The changes will be reflected in the "Null" hypothesis dynamically.
    • μₒ: The user can set population mean for One-Sample T-Test and Z-Test. The default value set is 0.
    • α: The user can set significance level for all Hypothesis Tests. The default value set is 0.05.


Screenshots for dock widgets:

Test View

Content under this section is yet to be added

<translate> </translate>

Statistic Table

Content under this section is yet to be added

Backend

Content under this section is yet to be added

Demonstrations

Content under this section is yet to be added

TODO

  • Add more tooltips to Result View
  • Check for assumptions using various tests (like Levene's Test).
  • Reimplement above features when data source type is Database.
  • Integrate various tests in one workbook to show a summary to the user in few clicks.
  • All other minor TODOs are already written as comments in source code itself.

Commits

My Commits: https://cgit.kde.org/labplot.git/log/?h=gsoc2019_stats&qt=author&q=Devanshu+Agarwal
These commits are reviewed on phabricator by my mentors Stefan Gerlach and Alexander Semke.

Review Request: https://phabricator.kde.org/p/devanshuagarwal/.

About Me

  • Name: Devanshu Agarwal
  • Mentors: Stefan Gerlach, Alexander Semke