GSoC/2019/StatusReports/DevanshuAgarwal: Difference between revisions

From KDE Community Wiki
< GSoC‎ | 2019‎ | StatusReports
(Work Till First Evaluation)
 
mNo edit summary
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Project Overview ==
== Project Overview ==
'''Project Name:''' Statistical Analysis in Labplot<br />
'''Purpose:''' Adding statistically relevant features in labplot.


== List of Added Features ==
'''Project Name:'''
I have added the following features for the first evaluation:
Statistical Analysis in Labplot
* '''TTest:'''  Two-Sample Independent, Two Sample Paired, One Sample.
* '''ZTest:''' Two-Sample Independent, Two Sample Paired, One Sample.
* '''ANOVA:''' One Way ANOVA.
* '''Levene Test:''' To check for the assumption of homogeneity of variance between populations


== Hypothesis Dock For the above-listed features. ==
'''Abstract:'''
I have added a dock for the hypothesis test, from where the user can select the type of test s/he wants to perform and will then get the list of options based on the test chosen.<br />
We aimed to add statistically relevant features in Labplot. These features should be able to give the correlation between data points and should perform various hypothesis testings along with assumption checking. Our target audience includes both scientists and engineers, hence we aimed to provide results in the form that is elaborative enough for any non-statistical person to use yet non-distractive for someone who is just interested in numbers.  
Some common list of options based on selected tests are:
* Selecting columns on which the user wants to perform the tests. The column names are shown on the combo-boxes. These combo-boxes are populated according to the type of values present in the columns. This is done to make sure that the user does not perform a test on wrong columns. For example, if the user selects "Two Sample Independent T-Test", then the first combo-box (for choosing independent variable) is populated by columns who are having only numerical values or exactly two categorical values and the second combo-box is populated with columns who have only numerical values. Sometimes there is the case where the column has numerical values but these values represent classes example 0 and 1, the user can tick on checkbox "Categorical Variable" for such cases.
* User can change the significance level (α) from the box provided in the dock. The default value set apriori is 0.05.
* User can change the population mean for One Sample TTest and ZTest. The default value set apriori is 0.
* User can perform Levene Test to check for homogeneity of variance between populations. This Test is currently used to check for the assumption of "Two Sample Independent TTest" and "One Way ANOVA Test". The Test can be performed by clicking on "Levene's Test" push button.
* User can select what is an alternate hypothesis and correspondingly the options of null hypothesis changes. From here the user can select whether s/he wants to perform Two-Tail Test or One-Tail Test (Positive tail or Negative Tail). By default, two tail is selected.
* "Do Test" pushbutton is enabled when all the mandatory options are selected. For example, if the user wants to perform One Way ANOVA, but there are no columns with two or more categorical values, the "Do Test" button will be disabled. This ensures the program does not get crash.


== Summary and Results in Hypothesis Test View ==  
== Proposal ==
A window by name "Hypothesis Test for Spreadsheet" is opened when the user selects the "hypothesis test" option. This window shows the result and summary table.
You can find my GSoC proposal here:
This whole window is divided into three sections.
[https://docs.google.com/document/d/1aoibrQXcpJwP8tGdaNrDwoP2LiTqkj9HwJ3gAqA361U/edit https://docs.google.com/document/d/1aoibrQXcpJwP8tGdaNrDwoP2LiTqkj9HwJ3gAqA361U/edit]
# First Section: It displays the type of test (title) being performed.  
# Second Section: It displays summary statistics in the form of a table, where common columns are mean, sum, number of values, standard deviation.  
# Third Section: It displays the final result of the tests. Commonly it displays t-value / f-value, p-value, Null Hypothesis, Alternate Hypothesis, Significance Level and Degree of Freedom. This section also gives the tip of what these values mean if the user hovers the mouse over the value. Currently, it gives the meaning of p-values i.e., based on p-value and significance level (α), it says whether the null hypothesis can be rejected or is there the plausibility for the null hypothesis to be true.


The text in this window is formatted (using HTML) for better user readability.
== List of Added Features ==
I have added the following features for the first evaluation:
* '''TTest'''
**Two-Sample Independent
**Two Sample Paired
**One Sample


== Backend Programming and Source Code ==
* '''ZTest'''
Backend Programming is done keeping in mind that these tests will be performed on huge data and hence, the computation time for each test is O(n) (where n is the maximum number of rows among selected columns). <br />
**Two-Sample Independent


In the third section of results and summary view, I have used the tooltip feature of Qt, to give the hints of results. The problem is the provided tooltip feature gives tooltip for the whole widget and not at the corresponding point in the widget. So multiple tooltips can't get placed in the same widget. The solution is:
* '''ANOVA'''
* Either subclass the widget and change the functionality.
**One Way ANOVA
* Have multiple widgets (I have used this solution). I have created the array of QLabels (I have not used QLineEdit as these are not HTML aware), and each result line is the separate QLabel. <br />
**TWo Way ANOVA
* '''Levene Test:''' To check for the assumption of homogeneity of variance between populations
* '''Correlation Coefficient'''
**Pearson's R
**Kendall's Tau
**Spearman Rank
**Chi-Square Test for Independence


I have added some helper private functions in "HypothesisTest.cpp" file. These functions avoid silly mistakes and increase the reusability of the code. These functions are:
== Status Reports ==
* '''findStats:''' Gives common statistics like mean, sum, number of values, the standard deviation on getting the pointer to a column.  
'''First Evaluation:'''<br>
* '''countParitions:''' It gives the number of classes in the given Independent Variable Column.
[https://docs.google.com/document/d/1JxA569fFTcrDUTHdInvKJPz9rXmVYM7DuYT54f7C38U/edit?usp=sharing https://docs.google.com/document/d/1JxA569fFTcrDUTHdInvKJPz9rXmVYM7DuYT54f7C38U/edit?usp=sharing]
* '''findStatsCategorical:''' It also gives common statistics, but here one independent variable (containing categorical classes) and one dependent variable are passed. This is created to ensure O(n) complexity.  
<br><br>
* '''getPValue:''' It gives the p-value on giving test_type (T-Test or F-Test), tail_type and corresponding T-value and F-value. It also prints the Alternate and Null Hypothesis in results.  
'''Second Evaluation:'''<br>
* '''getHtmlTable:''' It takes the number of rows, number of columns and list in row-major fashion containing the data and header values and returns the corresponding HTML Table (in form of QString). It is also created to maintain uniformity in the summary and results in view.
[https://docs.google.com/document/d/1qgss0AssIb3HJIDeAYIos2ig37tk_8UWqDsn4OwDPrQ/edit?usp=sharing https://docs.google.com/document/d/1qgss0AssIb3HJIDeAYIos2ig37tk_8UWqDsn4OwDPrQ/edit?usp=sharing]
* '''getLine:''' It takes line colour and message as arguments and returns QString containing the formatted HTML line. Default colour used is "black" (if the user does not pass any colour value).
<br><br>
* '''printLine:''' It prints the message in a given colour (default: "black", if the user doesn't pass any colour value), in the given index.  
'''Final Report:''' <br>
* '''printError:''' It prints the error message.  
I have included all my work with screenshots and demos in the final post of my blog.
Here is the link: [https://agdeva8labplot.blogspot.com/2019/08/final-days-of-gsoc-2019.html https://agdeva8labplot.blogspot.com/2019/08/final-days-of-gsoc-2019.html]


== TODO ==  
== TODO ==  
Currently, these features are yet to be implemented.
* Add more tooltips to Result View
* All these tests can be performed on the spreadsheet. We have to add the same functionality for database source so that the user doesn't have to copy the database to labplot.  
* Check for assumptions using various tests (like Levene's Test).
* Backend For ZTest. Backend Structure is ready but yet to compute p-value on given z-value.
* Reimplement above features when data source type is Database.  
* Automatic resizing of HypothesisTestDock and also adding vertical and horizontal scroll bars (In small screens, most of the dock is getting truncated).
* Integrate various tests in one workbook to show a summary to the user in few clicks.
* Adding more tips in the results section.  
* All other minor TODOs are already written as comments in source code itself.  
* Automatic Testing. Currently, I have tested my work manually and verified the results using JASP and [https://www.socscistatistics.com/ online calculator].
 
== Future Goals ==
We aim to generate a single self-contained report for the data, currently analysed by the user. This report will show the statistical analysis summary and graphs in one place, at a single click, without the need of the user to explicitly select or instruct anything unless he/she feels the need of doing so. The idea is to make the task of data analysis easy for the user and give him/her the freedom to play around with the data while keeping track of the changes occurring in different statistical parameters.


== Commits ==  
== Commits ==  
Currently, my commits are on [https://cgit.kde.org/labplot.git/log/?h=gsoc2019_stats gsoc2019_stats branch].
My Commits: [https://cgit.kde.org/labplot.git/log/?h=gsoc2019_stats&qt=author&q=Devanshu+Agarwal https://cgit.kde.org/labplot.git/log/?h=gsoc2019_stats&qt=author&q=Devanshu+Agarwal] <br>
These commits are reviewed on phabricator by my mentor Stefan Gerlach. Currently, the Levene test and One Way ANOVA test are yet to be reviewed on phabricator. Here is the list of my review requests for ANOVA and Levene Test.  
These commits are reviewed on phabricator by my mentors Stefan Gerlach and Alexander Semke. <br>
* [https://phabricator.kde.org/D21977 One Way ANOVA Test]
<br>
* [https://phabricator.kde.org/D21857 Levene Test]
Review Request: [https://phabricator.kde.org/p/devanshuagarwal/ https://phabricator.kde.org/p/devanshuagarwal/].


The whole list of my review request can be found [https://phabricator.kde.org/p/devanshuagarwal/ here].
== My Blog ==
[https://agdeva8labplot.blogspot.com/ https://agdeva8labplot.blogspot.com/]


== About Me ==  
== About Me ==  
* '''Name:''' Devanshu Agarwal
* '''Name:''' Devanshu Agarwal
* '''Mentors:''' Alexander Semke, Stefan Gerlach.
* '''Mentors:''' Stefan Gerlach, Alexander Semke
* '''Email:''' ​[email protected], ​ [email protected]
* '''Email:''' ​[email protected], ​ [email protected]
* '''Github Id:​''' ​https://github.com/agdeva8
* '''Github Id:​''' ​https://github.com/agdeva8
* '''IRC nickname:''' agdeva8
* '''IRC nickname:''' agdeva8

Latest revision as of 14:45, 24 August 2019

Project Overview

Project Name: Statistical Analysis in Labplot

Abstract: We aimed to add statistically relevant features in Labplot. These features should be able to give the correlation between data points and should perform various hypothesis testings along with assumption checking. Our target audience includes both scientists and engineers, hence we aimed to provide results in the form that is elaborative enough for any non-statistical person to use yet non-distractive for someone who is just interested in numbers.

Proposal

You can find my GSoC proposal here: https://docs.google.com/document/d/1aoibrQXcpJwP8tGdaNrDwoP2LiTqkj9HwJ3gAqA361U/edit

List of Added Features

I have added the following features for the first evaluation:

  • TTest
    • Two-Sample Independent
    • Two Sample Paired
    • One Sample
  • ZTest
    • Two-Sample Independent
  • ANOVA
    • One Way ANOVA
    • TWo Way ANOVA
  • Levene Test: To check for the assumption of homogeneity of variance between populations
  • Correlation Coefficient
    • Pearson's R
    • Kendall's Tau
    • Spearman Rank
    • Chi-Square Test for Independence

Status Reports

First Evaluation:
https://docs.google.com/document/d/1JxA569fFTcrDUTHdInvKJPz9rXmVYM7DuYT54f7C38U/edit?usp=sharing

Second Evaluation:
https://docs.google.com/document/d/1qgss0AssIb3HJIDeAYIos2ig37tk_8UWqDsn4OwDPrQ/edit?usp=sharing

Final Report:
I have included all my work with screenshots and demos in the final post of my blog. Here is the link: https://agdeva8labplot.blogspot.com/2019/08/final-days-of-gsoc-2019.html

TODO

  • Add more tooltips to Result View
  • Check for assumptions using various tests (like Levene's Test).
  • Reimplement above features when data source type is Database.
  • Integrate various tests in one workbook to show a summary to the user in few clicks.
  • All other minor TODOs are already written as comments in source code itself.

Future Goals

We aim to generate a single self-contained report for the data, currently analysed by the user. This report will show the statistical analysis summary and graphs in one place, at a single click, without the need of the user to explicitly select or instruct anything unless he/she feels the need of doing so. The idea is to make the task of data analysis easy for the user and give him/her the freedom to play around with the data while keeping track of the changes occurring in different statistical parameters.

Commits

My Commits: https://cgit.kde.org/labplot.git/log/?h=gsoc2019_stats&qt=author&q=Devanshu+Agarwal
These commits are reviewed on phabricator by my mentors Stefan Gerlach and Alexander Semke.

Review Request: https://phabricator.kde.org/p/devanshuagarwal/.

My Blog

https://agdeva8labplot.blogspot.com/

About Me