NAME

Statistics::Covid - Fetch, store in DB, retrieve and analyse Covid-19 statistics from data providers

VERSION

Version 0.23

DESCRIPTION

This module fetches, stores in a database, retrieves from a database and analyses Covid-19 statistics from online or offline data providers, such as from the John Hopkins University which I hope I am not obstructing (please send an email to the author if that is the case).

After specifying one or more data providers (as a url and a header for data and optionally for metadata), this module will attempt to fetch the latest data and store it in a database (SQLite and MySQL, only SQLite was tested so far). Each batch of data should ideally contain information about one or more locations and at a given point in time. All items in this batch are extracted and stored in DB each with its location name and time (it was published, not fetched) as primary keys. Each such data item (Datum) is described in Statistics::Covid::Datum::Table and the relevant class is Statistics::Covid::Datum. It contains fields such as: population, confirmed, unconfirmed, terminal, recovered.

Focus was on creating very high-level which distances as much as possible the user from the nitty-gritty details of fetching data using LWP::UserAgent and dealing with the database using DBI and DBIx::Class.

This is an early release until the functionality and the table schemata solidify.

SYNOPSIS

    use Statistics::Covid;
    use Statistics::Covid::Datum;

    $covid = Statistics::Covid->new({   
            'config-file' => 't/config-for-t.json',
            'providers' => ['UK::BBC', 'UK::GOVUK', 'World::JHU'],
            'save-to-file' => 1,
            'save-to-db' => 1,
            'debug' => 2,
    }) or die "Statistics::Covid->new() failed";
    # fetch all the data available (posibly json), process it,
    # create Datum objects, store it in DB and return an array 
    # of the Datum objects just fetched  (and not what is already in DB).
    my $newobjs = $covid->fetch_and_store();

    print $_->toString() for (@$newobjs);

    print "Confirmed cases for ".$_->name()
            ." on ".$_->date()
            ." are: ".$_->confirmed()
            ."\n"
    for (@$newobjs);

    my $someObjs = $covid->select_datums_from_db({
            'conditions' => {
                    belongsto=>'UK',
                    name=>'Hackney'
            }
    });

    print "Confirmed cases for ".$_->name()
            ." on ".$_->date()
            ." are: ".$_->confirmed()
            ."\n"
    for (@$someObjs);

    # or for a single place (this sub sorts results wrt publication time)
    my $timelineObjs = $covid->select_datums_from_db_for_specific_location_time_ascending('Hackney');
    # or for a wildcard match
    # $covid->select_datums_from_db_for_specific_location_time_ascending({'like'=>'Hack%'});
    # and maybe specifying max rows
    # $covid->select_datums_from_db_for_specific_location_time_ascending({'like'=>'Hack%'}, {'rows'=>10});
    for my $anobj (@$timelineObjs){
            print $anobj->toString()."\n";
    }

    print "datum rows in DB: ".$covid->db_count_datums()."\n"

    use Statistics::Covid;
    use Statistics::Covid::Datum;
    use Statistics::Covid::Utils;
    use Statistics::Covid::Analysis::Plot::Simple;

    # now read some data from DB and do things with it
    $covid = Statistics::Covid->new({   
            'config-file' => 't/config-for-t.json',
            'debug' => 2,
    }) or die "Statistics::Covid->new() failed";
    # retrieve data from DB for selected locations (in the UK)
    # data will come out as an array of Datum objects sorted wrt time
    # (the 'datetimeUnixEpoch' field)
    $objs = $covid->select_datums_from_db_for_specific_location_time_ascending(
            #{'like' => 'Ha%'}, # the location (wildcard)
            ['Halton', 'Havering'],
            #{'like' => 'Halton'}, # the location (wildcard)
            #{'like' => 'Havering'}, # the location (wildcard)
            'UK', # the belongsto (could have been wildcarded)
    );
    # create a dataframe
    $df = Statistics::Covid::Utils::datums2dataframe({
            'datum-objs' => $objs,
            # collect data from all those with same 'name' and same 'belongsto'
            # and maybe plot this data as a single curve (or fit or whatever)
            'groupby' => ['name','belongsto'],
            # put only these values of the datum object into the dataframe
            # one of them will be X, another will be Y
            # if you want to plot multiple Y, then add here more dependent columns
            # like ('unconfirmed').
            'content' => ['confirmed', 'unconfirmed', 'datetimeUnixEpoch'],
    });

    # plot confirmed vs time
    $ret = Statistics::Covid::Analysis::Plot::Simple::plot({
            'dataframe' => $df,
            # saves to this file:
            'outfile' => 'confirmed-over-time.png',
            # plot this column against X
            # (which is not present and default is time ('datetimeUnixEpoch')
            'Y' => 'confirmed',
    });

    # plot confirmed vs unconfirmed
    # if you see a vertical line it means that your data has no 'unconfirmed'
    $ret = Statistics::Covid::Analysis::Plot::Simple::plot({
            'dataframe' => $df,
            # saves to this file:
            'outfile' => 'confirmed-vs-unconfirmed.png',
            'X' => 'unconfirmed',
            # plot this column against X
            'Y' => 'confirmed',
    });

    # plot using an array of datum objects as they came
    # out of the DB. A dataframe is created internally to the plot()
    # but this is not recommended if you are going to make several
    # plots because equally many dataframes must be created and destroyed
    # internally instead of recycling them like we do here...
    $ret = Statistics::Covid::Analysis::Plot::Simple::plot({
            'datum-objs' => $objs,
            # saves to this file:
            'outfile' => 'confirmed-over-time.png',
            # plot this column as Y
            'Y' => 'confirmed', 
            # X is not present so default is time ('datetimeUnixEpoch')
            # and make several plots, each group must have 'name' common
            'GroupBy' => ['name', 'belongsto'],
            'date-format-x' => {
                    # see Chart::Clicker::Axis::DateTime for all the options:
                    format => '%m', ##<<< specify timeformat for X axis, only months
                    position => 'bottom',
                    orientation => 'horizontal'
            },
    });

    use Statistics::Covid;
    use Statistics::Covid::Datum;
    use Statistics::Covid::Utils;
    use Statistics::Covid::Analysis::Model::Simple;

    # create a dataframe
    my $df = Statistics::Covid::Utils::datums2dataframe({
            'datum-objs' => $objs,
            'groupby' => ['name'],
            'content' => ['confirmed', 'datetimeUnixEpoch'],
    });
    # convert all 'datetimeUnixEpoch' data to hours, the oldest will be hour 0
    for(sort keys %$df){
            Statistics::Covid::Utils::discretise_increasing_sequence_of_seconds(
                    $df->{$_}->{'datetimeUnixEpoch'}, # in-place modification
                    3600 # seconds->hours
            )
    }

    # do an exponential fit
    my $ret = Statistics::Covid::Analysis::Model::Simple::fit({
            'dataframe' => $df,
            'X' => 'datetimeUnixEpoch', # our X is this field from the dataframe
            'Y' => 'confirmed', # our Y is this field
            'initial-guess' => {'c1'=>1, 'c2'=>1}, # initial values guess
            'exponential-fit' => 1,
            'fit-params' => {
                    'maximum_iterations' => 100000
            }
    });

    # fit to a polynomial of degree 10 (max power of x is 10)
    my $ret = Statistics::Covid::Analysis::Model::Simple::fit({
            'dataframe' => $df,
            'X' => 'datetimeUnixEpoch', # our X is this field from the dataframe
            'Y' => 'confirmed', # our Y is this field
            # initial values guess (here ONLY for some coefficients)
            'initial-guess' => {'c1'=>1, 'c2'=>1},
            'polynomial-fit' => 10, # max power of x is 10
            'fit-params' => {
                    'maximum_iterations' => 100000
            }
    });

    # fit to an ad-hoc formula in 'x'
    # (see L<Math::Symbolic::Operator> for supported operators)
    my $ret = Statistics::Covid::Analysis::Model::Simple::fit({
            'dataframe' => $df,
            'X' => 'datetimeUnixEpoch', # our X is this field from the dataframe
            'Y' => 'confirmed', # our Y is this field
            # initial values guess (here ONLY for some coefficients)
            'initial-guess' => {'c1'=>1, 'c2'=>1},
            'formula' => 'c1*sin(x) + c2*cos(x)',
            'fit-params' => {
                    'maximum_iterations' => 100000
            }
    });

    # this is what fit() returns

    # $ret is a hashref where key=group-name, and
    # value=[ 3.4,  # <<<< mean squared error of the fit
    #  [
    #     ['c1', 0.123, 0.0005], # <<< coefficient c1=0.123, accuracy 0.00005 (ignore that)
    #     ['c2', 1.444, 0.0005]  # <<< coefficient c1=1.444
    #  ]
    # and group-name in our example refers to each of the locations selected from DB
    # in this case data from 'Halton' in 'UK' was fitted on 0.123*1.444^time with an m.s.e=3.4

    # This is what the dataframe looks like:
    #  {
    #  Halton   => {
    #               confirmed => [0, 0, 3, 4, 4, 5, 7, 7, 7, 8, 8, 8],
    #               datetimeUnixEpoch => [
    #                 1584262800,
    #                 1584349200,
    #                 1584435600,
    #                 1584522000,
    #                 1584637200,
    #                 1584694800,
    #                 1584781200,
    #                 1584867600,
    #                 1584954000,
    #                 1585040400,
    #                 1585126800,
    #                 1585213200,
    #               ],
    #             },
    #  Havering => {
    #               confirmed => [5, 5, 7, 7, 14, 19, 30, 35, 39, 44, 47, 70],
    #               datetimeUnixEpoch => [
    #                 1584262800,
    #                 1584349200,
    #                 1584435600,
    #                 1584522000,
    #                 1584637200,
    #                 1584694800,
    #                 1584781200,
    #                 1584867600,
    #                 1584954000,
    #                 1585040400,
    #                 1585126800,
    #                 1585213200,
    #               ],
    #             },
    #  }

    # and after converting the datetimeUnixEpoch values to hours and setting the oldest to t=0
    #  {
    #  Halton   => {
    #                confirmed => [0, 0, 3, 4, 4, 5, 7, 7, 7, 8, 8, 8],
    #                datetimeUnixEpoch => [0, 24, 48, 72, 104, 120, 144, 168, 192, 216, 240, 264],
    #              },
    #  Havering => {
    #                confirmed => [5, 5, 7, 7, 14, 19, 30, 35, 39, 44, 47, 70],
    #                datetimeUnixEpoch => [0, 24, 48, 72, 104, 120, 144, 168, 192, 216, 240, 264],
    #              },
    #  }



    use Statistics::Covid::Analysis::Plot::Simple;

    # plot something
    my $objs = $io->db_select({
            conditions => {belongsto=>'UK', name=>{'like' => 'Ha%'}}
    });
    my $outfile = 'chartclicker.png';
    my $ret = Statistics::Covid::Analysis::Plot::Simple::plot({
            'datum-objs' => $objs,
            # saves to this file:
            'outfile' => $outfile,
            # plot this column (x-axis is time always)
            'Y' => 'confirmed', 
            # and make several plots, each group must have 'name' common
            'GroupBy' => ['name']
    });

EXAMPLE SCRIPT

script/statistics-covid-fetch-data-and-store.pl is a script which accompanies this distribution. It can be used to fetch any data from specified providers using a specified configuration file.

For a quick start:

cp t/config-for-t.json config.json
# optionally modify config.json to change the destination data dirs
# now fetch data from some default data providers:
script/statistics-covid-fetch-data-and-store.pl --config-file config.json

The above will fetch the latest data and insert it into an SQLite database in data/db/covid19.sqlite directory. When this script is called again, it will fetch the data again and will be saved into a file timestamped with publication date. So, if data was already fetched it will be simply overwritten by this same data.

As far as updating the database is concerned, only newer, up-to-date data will be inserted. So, calling this script, say once or twice will make sure you have the latest data without accummulating it redundantly.

But please call this script AT MAXIMUM one or two times per day so as not to obstruct public resources. Please, Please.

When the database is up-to-date, analysis of data is the next step.

In the synopis, it is shown how to select records from the database, as an array of Statistics::Covid::Datum objects. Feel free to share any modules you create on analysing this data, either under this namespace (for example Statistics::Covid::Analysis::XYZ) or any other you see appropriate.

CONFIGURATION FILE

Below is an example configuration file which is essentially JSON with comments. It can be found in t/config-for-t.json relative to the root directory of this distribution.

    # comments are allowed, otherwise it is json
    # this file does not get eval'ed, it is parsed
    # only double quotes! and no excess commas
    {
            # fileparams options
            "fileparams" : {
                    # dir to store datafiles, each DataProvider class
                    # then has its own path to append
                    "datafiles-dir" : "datazz/files"
            },
            # database IO options
            "dbparams" : {
                    # which DB to use: SQLite, MySQL (case sensitive)
                    "dbtype" : "SQLite",
                    # the name of DB
                    # in the case of SQLite, this is a filepath
                    # all non-existing dirs will be created (by module, not by DBI)
                    "dbdir" : "datazz/db",
                    "dbname" : "covid19.sqlite",
                    # how to handle duplicates in DB? (duplicate=have same PrimaryKey)
                    # only-better : replace records in DB if outdated (meaning number of markers is less, e.g. terminal or confirmed)
                    # replace     : force replace irrespective of markers
                    # ignore      : if there is a duplicate in DB DONT REPLACE/DONT INSERT
                    # (see also Statistics::Covid::Datum for up-to-date info)
                    "replace-existing-db-record" : "only-better",
                    # username and password if needed
                    # unfortunately this is in plain text
                    # BE WARNED: do not store your main DB password here!!!!
                    # perhaps create a new user or use SQLite
                    # there is no need for these when using SQLite
                    "hostname" : "", # must be a string (MySQL-related)
                    "port"     : "", # must be a string (MySQL-related)
                    "username" : "", # must be a string
                    "password" : "", # must be a string
                    # options to pass to DBI::connect
                    # see https://metacpan.org/pod/DBI for all options
                    "dbi-connect-params" : {
                            "RaiseError" : 1, # die on error
                            "PrintError" : 0  # do not print errors or warnings
                    }
            }
    }

DATABASE SUPPORT

SQLite and MySQL database types are supported through the abstraction offered by DBI and DBIx::Class.

However, only the SQLite support has been tested.

Support for MySQL is totally untested.

AUTHOR

Andreas Hadjiprocopis, <bliako at cpan.org>, <andreashad2 at gmail.com>

BENCHMARKS

There are some benchmark tests to time database insertion and retrieval performance. These are optional and will not be run unless explicitly stated via make bench

These tests do not hit the online data providers at all. And they should not, see ADDITIONAL TESTING for more information on this. They only time the creation of objects and insertion to the database.

ADDITIONAL TESTING

Testing the DataProviders is not done because it requires network access and hits on the providers which is not fair. However, there are targets in the Makefile for initiating the "network" tests by doing make network .

CAVEATS

This module has been put together very quickly and under pressure. There are must exist quite a few bugs. In addition, the database schema, the class functionality and attributes are bound to change. A migration database script may accompany new versions in order to use the data previously collected and stored.

Support for MySQL is totally untested. Please use SQLite for now or test the MySQL interface.

Support for Postgres has been somehow missed but is underway!.

BUGS

This module has been put together very quickly and under pressure. There are must exist quite a few bugs.

Please report any bugs or feature requests to bug-statistics-Covid at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Statistics-Covid. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Statistics::Covid

You can also look for information at:

DEDICATIONS

Almaz

ACKNOWLEDGEMENTS

LICENSE AND COPYRIGHT

Copyright 2020 Andreas Hadjiprocopis.

This program is free software; you can redistribute it and/or modify it under the terms of the the Artistic License (2.0). You may obtain a copy of the full license at:

http://www.perlfoundation.org/artistic_license_2_0

Any use, modification, and distribution of the Standard or Modified Versions is governed by this Artistic License. By using, modifying or distributing the Package, you accept this license. Do not use, modify, or distribute the Package, if you do not accept this license.

If your Modified Version has been derived from a Modified Version made by someone other than you, you are nevertheless required to ensure that your Modified Version complies with the requirements of this license.

This license does not grant you the right to use any trademark, service mark, tradename, or logo of the Copyright Holder.

This license includes the non-exclusive, worldwide, free-of-charge patent license to make, have made, use, offer to sell, sell, import and otherwise transfer the Package with respect to any patent claims licensable by the Copyright Holder that are necessarily infringed by the Package. If you institute patent litigation (including a cross-claim or counterclaim) against any party alleging that the Package constitutes direct or contributory patent infringement, then this Artistic License to you shall terminate on the date that such litigation is filed.

Disclaimer of Warranty: THE PACKAGE IS PROVIDED BY THE COPYRIGHT HOLDER AND CONTRIBUTORS "AS IS' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES. THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT ARE DISCLAIMED TO THE EXTENT PERMITTED BY YOUR LOCAL LAW. UNLESS REQUIRED BY LAW, NO COPYRIGHT HOLDER OR CONTRIBUTOR WILL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING IN ANY WAY OUT OF THE USE OF THE PACKAGE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.