NAME

Statistics::Covid::Analysis::Model::Simple - Fits the data to various models

VERSION

Version 0.23

DESCRIPTION

This package contains routine(s) for modelling 2D data. It can be used to model how markers in Statistics::Covid::Datum, like confirmed, etc. vary with time by fitting the series of time, value pairs to a polynomial (c0+c1*x+c2*x^2+...cn*x^n), or an exponential (c0 * c1^x) model.

SYNOPSIS

        use Statistics::Covid;
        use Statistics::Covid::Datum;
        use Statistics::Covid::Utils;
        use Statistics::Covid::Analysis::Model::Simple;

        # read data from db
        $covid = Statistics::Covid->new({   
                'config-file' => 't/config-for-t.json',
                'debug' => 2,
        }) or die "Statistics::Covid->new() failed";
        # retrieve data from DB for selected locations (in the UK)
        # data will come out as an array of Datum objects sorted wrt time
        # (the 'datetimeUnixEpoch' field)
        my $objs = $covid->select_datums_from_db_for_specific_location_time_ascending(
                #{'like' => 'Ha%'}, # the location (wildcard)
                ['Halton', 'Havering'],
                #{'like' => 'Halton'}, # the location (wildcard)
                #{'like' => 'Havering'}, # the location (wildcard)
                'UK', # the belongsto (could have been wildcarded)
        );
        # create a dataframe
        my $df = Statistics::Covid::Utils::datums2dataframe({
                'datum-objs' => $objs,
                'groupby' => ['name'],
                'content' => ['confirmed', 'datetimeUnixEpoch'],
        });
        # convert all 'datetimeUnixEpoch' data to hours, the oldest will be hour 0
        for(sort keys %$df){
                Statistics::Covid::Utils::discretise_increasing_sequence_of_seconds(
                        $df->{$_}->{'datetimeUnixEpoch'}, # in-place modification
                        3600 # seconds->hours
                )
        }

        # do an exponential fit
        my $ret = Statistics::Covid::Analysis::Model::Simple::fit({
                'dataframe' => $df,
                'X' => 'datetimeUnixEpoch', # our X is this field from the dataframe
                'Y' => 'confirmed', # our Y is this field
                'initial-guess' => {'c1'=>1, 'c2'=>1}, # initial values guess
                'exponential-fit' => 1,
                'fit-params' => {
                        'maximum_iterations' => 100000
                }
        });

        # fit to a polynomial of degree 10 (max power of x is 10)
        my $ret = Statistics::Covid::Analysis::Model::Simple::fit({
                'dataframe' => $df,
                'X' => 'datetimeUnixEpoch', # our X is this field from the dataframe
                'Y' => 'confirmed', # our Y is this field
                # initial values guess (here ONLY for some coefficients)
                'initial-guess' => {'c1'=>1, 'c2'=>1},
                'polynomial-fit' => 10, # max power of x is 10
                'fit-params' => {
                        'maximum_iterations' => 100000
                }
        });

        # fit to an ad-hoc formula in 'x'
        # (see L<Math::Symbolic::Operator> for supported operators)
        my $ret = Statistics::Covid::Analysis::Model::Simple::fit({
                'dataframe' => $df,
                'X' => 'datetimeUnixEpoch', # our X is this field from the dataframe
                'Y' => 'confirmed', # our Y is this field
                # initial values guess (here ONLY for some coefficients)
                'initial-guess' => {'c1'=>1, 'c2'=>1},
                'formula' => 'c1*sin(x) + c2*cos(x)',
                'fit-params' => {
                        'maximum_iterations' => 100000
                }
        });

        # this is what fit() returns

        # $ret is a hashref where key=group-name, and
        # value=[ 3.4,  # <<<< mean squared error of the fit
        #  [
        #     ['c1', 0.123, 0.0005], # <<< coefficient c1=0.123, accuracy 0.00005 (ignore that)
        #     ['c2', 1.444, 0.0005]  # <<< coefficient c1=1.444
        #  ]
        # and group-name in our example refers to each of the locations selected from DB
        # in this case data from 'Halton' in 'UK' was fitted on 0.123*1.444^time with an m.s.e=3.4

        # This is what the dataframe looks like:
        #  {
        #  Halton   => {
        #               confirmed => [0, 0, 3, 4, 4, 5, 7, 7, 7, 8, 8, 8],
        #               datetimeUnixEpoch => [
        #                 1584262800,
        #                 1584349200,
        #                 1584435600,
        #                 1584522000,
        #                 1584637200,
        #                 1584694800,
        #                 1584781200,
        #                 1584867600,
        #                 1584954000,
        #                 1585040400,
        #                 1585126800,
        #                 1585213200,
        #               ],
        #             },
        #  Havering => {
        #               confirmed => [5, 5, 7, 7, 14, 19, 30, 35, 39, 44, 47, 70],
        #               datetimeUnixEpoch => [
        #                 1584262800,
        #                 1584349200,
        #                 1584435600,
        #                 1584522000,
        #                 1584637200,
        #                 1584694800,
        #                 1584781200,
        #                 1584867600,
        #                 1584954000,
        #                 1585040400,
        #                 1585126800,
        #                 1585213200,
        #               ],
        #             },
        #  }

        # and after converting the datetimeUnixEpoch values to hours and setting the oldest to t=0
        #  {
        #  Halton   => {
        #                confirmed => [0, 0, 3, 4, 4, 5, 7, 7, 7, 8, 8, 8],
        #                datetimeUnixEpoch => [0, 24, 48, 72, 104, 120, 144, 168, 192, 216, 240, 264],
        #              },
        #  Havering => {
        #                confirmed => [5, 5, 7, 7, 14, 19, 30, 35, 39, 44, 47, 70],
        #                datetimeUnixEpoch => [0, 24, 48, 72, 104, 120, 144, 168, 192, 216, 240, 264],
        #              },
        #  }

fit

Tries to fit a model on some 2D data using Algorithm::CurveFit. It knows how to do an exponential fit (c0 * c1^x), a polynomial fit (c0+c1*x+c2*x^2+...cn*x^n) or any other formula Math::Symbolic supports.

It takes a hashref of parameters:

dataframe, this is a hashref where each key is a separate piece of data that needs to be fitted. For example, key can be 'China' and/or 'Italy' etc. The value for each key is a hashref. The keys of this are data names and the values are arrayrefs of the corresponding values for that key. Here is an example:
        $df = {
                'China' => {
                        'confirmed' => [1,2,3],
                        'datetimeUnixEpoch' => [1584262800, 1584264800, 1584266800],
                },
                'Italy' => {
                        'confirmed' => [5,6,7],
                        'datetimeUnixEpoch' => [1584265800, 1584267800, 1584269800],
                },

'China' and 'Italy' are completely independent, their datetimeUnixEpoch need not be the same. Such a dataframe can hold any type of data. In our example it's data from this situation. The number of 1st-level and 2nd-level keys can be 1 or more (not just 2 as in the above example). Such a dataframe can be converted from an array of Statistics::Covid::Datum objects using Statistics::Covid::Utils::datums2dataframe. An example of creating it is in the SYNOPSIS, above.

exponential-fit if this key exists and is not zero, an exponential fit will be done. It is optional. It can not exist at the same time as the polynomial-fit key.
polynomial-fit if this key exists and is not zero, an polynomial fit will be done. It is optional. It can not exist at the same time as the exponential-fit key.
<Cformula> must exist if neither exponential-fit nor polynomial-fit exist. It is a string with a mathematical formula of a function in x whose coefficients (the constants, the parameters, etc.) will be found using Algorithm::CurveFit. An example: c1*x + c2*x^2 or a*sin(x) + b*cos(x). Only a few operatos are supported (see Math::Symbolic::Operator for what is supported. The power (exponentiation) operator is ^ (and not Perl's **).
X, a string of a field name (one of the 2nd-level keys of the input dataframe) which will supply the x-data (in the above example, one of confirmed or datetimeUnixEpoch. But since, be convention, X is the independent variable, it makes sense to use datetimeUnixEpoch, time.
Y, a string of a field name (one of the 2nd-level keys of the input dataframe). Y denotes a dependent variable and therefore confirmed (which is a function of datetimeUnixEpoch can be used, in our case)
accuracy, this can either be a scalar or a hashref and represents the accuracy for each coefficient we seek to fit. If it's a scalar (a number in our case) it will be used for all coefficients in the formula. If it is a hashref, it will hold accuracy values for one or more or all coefficients in the formula.
accuracy, this can either be a scalar or a hashref and represents the accuracy for each coefficient we seek to fit. If it's a scalar (a number in our case) it will be used for all coefficients in the formula. If it is a hashref, it will hold accuracy values for one or more or all coefficients in the formula.
initial-guess, this can either be a scalar or a hashref and represents the initial (guessed) value for each coefficient we seek to fit. If it's a scalar (a number in our case) it will be used for all coefficients in the formula. If it is a hashref, it will hold accuracy values for one or more or all coefficients in the formula. Initial conditions are crucial in some cases. In other cases they can be omitted. In rare occasions and for complex functions (not for exponential or polynomial fits) Algorithm::CurveFit can stall or break if this guess is not right, it complains that its matrices are filled with Inf.
groups, this is an arrayref of one or more or all of the 1st-level keys. Each key mentioned will be fitted. If groups is omitted then all 1st-level keys in the dataframe will be fitted.

On failure it returns undef. On success it returns a hashref where key=group-name, and # value=[ 3.4, # <<<< mean squared error of the fit # [ # ['c1', 0.123, 0.0005], # <<< coefficient c1=0.123, accuracy 0.00005 (ignore that) # ['c2', 1.444, 0.0005] # <<< coefficient c1=1.444 # ... # for all the coefficients in the input formula (or polynomial) # ] # and group-name in our example refers to each of the locations selected from DB # in this case data from 'Halton' in 'UK' was fitted on 0.123*1.444^time with an m.s.e=3.4

EXPORT

None by default. But Statistics::Covid::Analysis::Model::Simple::fit() is the sub to call. Also the $DEBUG can be set to 1 or more for more verbose output, like $Statistics::Covid::Analysis::Model::Simple::DEBUG=1;

SEE ALSO

This package relies heavily on Algorithm::CurveFit. The formula notation is exactly the one used by Math::Symbolic.

Statistics::Regression and Statistics::LineFit can be used to do linear regression. Which is a far simpler method that the symbolic approach we take in this package. However, the benefit of our approach is that it can try to fit data with any formula, any model. The cost is that it is slower (for complex cases) and may lack robustness.

AUTHOR

Andreas Hadjiprocopis, <bliako at cpan.org>, <andreashad2 at gmail.com>

BUGS

This module has been put together very quickly and under pressure. There are must exist quite a few bugs.

Please report any bugs or feature requests to bug-statistics-Covid at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Statistics-Covid. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Statistics::Covid::Analysis::Model::Simple

You can also look for information at:

DEDICATIONS

Almaz

ACKNOWLEDGEMENTS

Perlmonks for supporting the world with answers and programming enlightment
DBIx::Class
the data providers:
John Hopkins University,
UK government,
https://www.bbc.co.uk (for disseminating official results)

LICENSE AND COPYRIGHT

Copyright 2020 Andreas Hadjiprocopis.

This program is free software; you can redistribute it and/or modify it under the terms of the the Artistic License (2.0). You may obtain a copy of the full license at:

http://www.perlfoundation.org/artistic_license_2_0

Any use, modification, and distribution of the Standard or Modified Versions is governed by this Artistic License. By using, modifying or distributing the Package, you accept this license. Do not use, modify, or distribute the Package, if you do not accept this license.

If your Modified Version has been derived from a Modified Version made by someone other than you, you are nevertheless required to ensure that your Modified Version complies with the requirements of this license.

This license does not grant you the right to use any trademark, service mark, tradename, or logo of the Copyright Holder.

This license includes the non-exclusive, worldwide, free-of-charge patent license to make, have made, use, offer to sell, sell, import and otherwise transfer the Package with respect to any patent claims licensable by the Copyright Holder that are necessarily infringed by the Package. If you institute patent litigation (including a cross-claim or counterclaim) against any party alleging that the Package constitutes direct or contributory patent infringement, then this Artistic License to you shall terminate on the date that such litigation is filed.

Disclaimer of Warranty: THE PACKAGE IS PROVIDED BY THE COPYRIGHT HOLDER AND CONTRIBUTORS "AS IS' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES. THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT ARE DISCLAIMED TO THE EXTENT PERMITTED BY YOUR LOCAL LAW. UNLESS REQUIRED BY LAW, NO COPYRIGHT HOLDER OR CONTRIBUTOR WILL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING IN ANY WAY OUT OF THE USE OF THE PACKAGE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.