| Title: | Data Sets for Courses at the Münster School of Business |
|---|---|
| Description: | Provides sample data sets that are used in statistics and data science courses at the Münster School of Business. The datasets refer to different business topics but also other domains, e.g. sports, traffic, etc. |
| Authors: | Michael Bücker [aut, cre] (ORCID: <https://orcid.org/0000-0003-0045-8460>), Niels Schlüsener [aut] |
| Maintainer: | Michael Bücker <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.1.0 |
| Built: | 2026-05-14 05:13:51 UTC |
| Source: | https://github.com/mchlbckr/msbstatsdata |
Official finisher results for the Berlin Marathon from 1999 to 2019, including cumulative split times at 5 km intervals and half marathon.
berlin_marathonberlin_marathon
A tibble with 678,711 rows and 18 variables:
Race year.
Race name.
Overall finishing place in the published result list.
Runner first name.
Runner last name.
Three-letter nationality code.
Club or country/team label as published in the source data.
Runner gender ("M" or "W" in the source data).
Official marathon finishing time.
Cumulative time at 5 km.
Cumulative time at 10 km.
Cumulative time at 15 km.
Cumulative time at 20 km.
Cumulative half-marathon time.
Cumulative time at 25 km.
Cumulative time at 30 km.
Cumulative time at 35 km.
Cumulative time at 40 km.
Data files from Andrew Miller's marathon-results repository:
https://github.com/AndrewMillerOnline/marathon-results.
Berlin Marathon files used for this dataset: https://github.com/AndrewMillerOnline/marathon-results/tree/main/Berlin.
Annual revenues of two beverage manufacturers in thousand EUR.
beverage_revenuesbeverage_revenues
A tibble with 4 rows and 3 variables:
Calendar year.
Revenue of Spritzi in thousand EUR.
Revenue of Prickli in thousand EUR.
The dataset contains Borussia Dortmund's final Bundesliga ranking for each season from 1988 to 2022.
bvb_rankingsbvb_rankings
A tibble with 35 rows and 2 variables:
Calendar year of the season endpoint.
Final Bundesliga ranking (1 = best rank).
The dataset contains counts of preferred car brands by occupation.
car_occupationcar_occupation
A tibble with 4 rows and 6 variables:
Occupation category.
Count of Audi drivers.
Count of BMW drivers.
Count of Opel drivers.
Count of VW drivers.
Row total.
The dataset contains one row per surveyed person with occupation and car brand.
car_occupation_indcar_occupation_ind
A tibble with 500 rows and 2 variables:
Occupation category.
Car brand category.
The dataset contains service times (in seconds) of 50 consecutive customers at a supermarket checkout.
checkout_timescheckout_times
A tibble with 50 rows and 1 variable:
Observed service time in seconds.
Probability function for group size in cinema visits.
cinema_group_size_pmfcinema_group_size_pmf
A tibble with 5 rows and 2 variables:
Number of persons per group.
Probability mass.
The dataset contains an absolute frequency distribution of cinema visitors per day recorded over 100 days.
cinema_visitorscinema_visitors
A tibble with 11 rows and 2 variables:
Number of visitors counted on a day.
Number of days with the respective visitor count.
Expanded version of cinema_visitors with one row per observed day.
cinema_visitors_indcinema_visitors_ind
A tibble with 100 rows and 1 variable:
Number of visitors counted on the day.
Ten observed monthly cold rents in EUR from one neighborhood.
cold_rentscold_rents
A tibble with 10 rows and 1 variable:
Monthly cold rent in EUR.
Sample of four companies used in introductory tasks on variables and scales.
company_financialscompany_financials
A tibble with 4 rows and 5 variables:
Company name.
Number of employees.
Annual revenue in million EUR.
Equity share in percent.
Rating category.
Two discrete probability distributions for monthly sales volumes.
computer_sales_countrycomputer_sales_country
A tibble with 7 rows and 3 variables:
Monthly number of sold units.
Probability in country A.
Probability in country B.
Two independent reviewers rated seven companies on a 1-10 scale.
credit_ratings_two_reviewerscredit_ratings_two_reviewers
A tibble with 7 rows and 3 variables:
Company identifier.
Score from reviewer A.
Score from reviewer B.
Party-level second-vote results by German federal state for the 2021 Bundestag election result table (official publication format).
de_electionsde_elections
A tibble with 377 rows and 6 variables:
Election year.
Two-digit federal state code.
Federal state name.
Party label in the official result file.
Number of second votes.
Second-vote share in percent.
Monthly Harmonised Index of Consumer Prices (HICP) sub-indices for selected energy categories in Germany (index base 2015 = 100).
de_energy_pricesde_energy_prices
A tibble with 480 rows and 4 variables:
Month of observation.
Energy category series label.
COICOP classification code.
Price index value (2015 = 100).
The decathlon for men is a combined event in athletics consisting of 10 track and field events: 100 metres, 400 metres, 1500 metres, 110 metre hurdles, long jump, high jump, pole vault, discus throw, javelin throw, and shotput.
decathlondecathlon
A tibble with 7,958 rows and 10 variables:
dbl result of 100m race in seconds
dbl result of long jump in meters
dbl result of shot put in meters
dbl result of high jump in meters
dbl result of 400m race in seconds
dbl result of 110m hurdles race in seconds
dbl result of discus throw in meters
dbl result of pole valut in meters
dbl result of javelin throw in meters
dbl result of 1500m race in seconds
Covariances between the same three disciplines used in decathlon_3disc_summary.
decathlon_3disc_covariancesdecathlon_3disc_covariances
A tibble with 3 rows and 2 variables:
Discipline pair label.
Covariance for the pair.
Mean and variance for 100m, long jump, and shot put from a large decathlon sample.
decathlon_3disc_summarydecathlon_3disc_summary
A tibble with 3 rows and 3 variables:
Discipline name.
Arithmetic mean of results.
Variance of results.
The dataset contains responses of 10 sampled persons from an EU-wide survey on environmental protection.
env_survey_eu10env_survey_eu10
A tibble with 10 rows and 4 variables:
Age of respondent.
Gender of respondent.
Monthly net income in EUR.
Attitude towards environmental protection.
Study days and exam points of five participants.
exam_study_timeexam_study_time
A tibble with 5 rows and 3 variables:
Study time in days.
Achieved points.
Maximum achievable points.
The dataset contains a contingency table on sentencing outcomes by defendant skin color.
florida_sentencingflorida_sentencing
A tibble with 3 rows and 4 variables:
Defendant skin color and total row.
Count of death penalty sentences.
Count of other sentences.
Row total.
The dataset contains one row per case with defendant skin color and sentence category.
florida_sentencing_indflorida_sentencing_ind
A tibble with 326 rows and 2 variables:
Defendant skin color.
Sentence category.
Ten observed prices for regular gasoline (EUR per liter) from a motorway section.
gas_pricesgas_prices
A tibble with 10 rows and 1 variable:
Observed price in EUR per liter.
The dataset contains human resources data of a footwear company. Each entity of the dataset represents one employee with a total of six attributes.
hr_datahr_data
A tibble with 1,200 rows and 6 variables:
Position of the employee in the company.
Contracted working hours of the employee per week.
Monthly salary of the employee.
Hourly compensation of the employee.
Department in which the employee is employed.
Sick days of the employee in the period observed.
Symbolic probability model for overdue weeks.
library_overdue_modellibrary_overdue_model
A tibble with 5 rows and 2 variables:
Overdue weeks.
Probability expression as text.
Frequency sample of overdue weeks used for method-of-moments exercises.
library_overdue_samplelibrary_overdue_sample
A tibble with 5 rows and 2 variables:
Overdue weeks.
Observed frequency.
The dataset contains product-, marketing- and sales data of 235 shoes of a footwear company. Each entity represents one shoe, listed with its product data, marketing data and sales data in a total of 14 variables.
marketing_expensesmarketing_expenses
A tibble with 235 rows and 14 variables:
Expenses for marketing activities for the shoe.
Estimated number of customers reached by the footwear marketing activity of the shoe.
Number of negative reactions to the marketing activities of the shoe.
Retail price of the shoe.
Price segment of the shoe.
Number of sizes in which the shoe is available.
Gender the shoe is intended for.
Average product rating of the test customers for the shoe.
Average product rating of the real customers for the shoe.
Color in which the shoe is selled the most often.
Rate how often the shoe is returned by the customer.
Number of sales for the shoe.
Attribute 'rating_customers', divided into ranks for the calculation of the correlation coefficient according to Spearman.
Attribute 'price_segment', divided into ranks for the calculation of the correlation coefficient according to Spearman.
Hourly totals derived from 15-minute open data published by the City of Muenster for available bike counting stations in 2025.
ms_bike_hourly_2025ms_bike_hourly_2025
A tibble with 210,240 rows and 4 variables:
Station identifier.
Readable station name.
Hour timestamp (local time, Europe/Berlin).
Total counted bikes in the hour.
Citywide hourly totals aggregated across available bike counting stations in Muenster for 2025.
ms_bike_hourly_2025_cityms_bike_hourly_2025_city
A tibble with 8,760 rows and 2 variables:
Hour timestamp (local time, Europe/Berlin).
Total counted bikes across all available stations.
Metadata for selected public bike counting stations in Muenster.
ms_bike_sitesms_bike_sites
A tibble with 24 rows and 2 variables:
Station identifier used in the source repository.
Readable station name.
Annual population counts for NUTS-3 districts in North Rhine-Westphalia (NRW), Germany.
nrw_populationnrw_population
A tibble with 318 rows and 5 variables:
Reference year.
NUTS-3 district code.
District name.
District type (urban or rural district).
Total resident population.
Order quantities per transaction from a small ordering sample.
order_quantitiesorder_quantities
A tibble with 20 rows and 1 variable:
Ordered units per order.
Two small populations used for comparing dispersion measures.
populations_i_iipopulations_i_ii
A tibble with 4 rows and 2 variables:
Values of population I.
Values of population II.
Demand measured across test phases with varying selling price and product quality.
product_demand_testphasesproduct_demand_testphases
A tibble with 6 rows and 3 variables:
Selling price in EUR.
Product quality score.
Observed demand.
Property prices per square meter and distance to next suburban rail station.
property_prices_distanceproperty_prices_distance
A tibble with 5 rows and 2 variables:
Distance to station in km.
Price in EUR per square meter.
Monthly cold rent and living area for five apartments.
rent_living_arearent_living_area
A tibble with 5 rows and 2 variables:
Living area in square meters.
Monthly cold rent in EUR.
The dataset contains monthly spending of a company on research and advertising (in thousand EUR).
research_adsresearch_ads
A tibble with 12 rows and 3 variables:
Month abbreviation.
Research spending in thousand EUR.
Advertising spending in thousand EUR.
Dataset containing the average salary of a footwear company's employees over 10 years.
salary_trendssalary_trends
A tibble with 10 rows and 2 variables:
Year of record.
Average salary in the corresponding year.
Five annual closing prices for security A and security B.
security_prices_absecurity_prices_ab
A tibble with 5 rows and 3 variables:
Time index (1 to 5).
Closing price of security A in EUR.
Closing price of security B in EUR (year 5 unknown = NA).
Dataset recording errors in the shipment and the type of delivery requested by the customer in the process.
shipping_errorsshipping_errors
A tibble with 576 rows and 2 variables:
Indicator whether or not an error appeared during shipment.
Shipment method requested by the customer.
The dataset contains throughput times of a machine recorded in order to compare them to the manufacturer's specifications.
sorting_timessorting_times
A tibble with 60 rows and 1 variable:
Recorded time (in seconds) required by the machine for sorting individual parts.
The dataset contains counts of sport activity frequencies by occupation.
sport_occupationsport_occupation
A tibble with 5 rows and 4 variables:
Occupation category.
Count with no sport activity.
Count with occasional sport activity.
Count with regular sport activity.
The dataset contains one row per surveyed person with occupation and sport activity frequency.
sport_occupation_indsport_occupation_ind
A tibble with 1,000 rows and 2 variables:
Occupation category.
Sport activity frequency category.
The dataset contains grouped TV viewing counts by gender.
tv_gendertv_gender
A tibble with 3 rows and 4 variables:
Grouped weekly TV viewing time and total row.
Count of men.
Count of women.
Row total.
The dataset contains one row per person with gender and grouped weekly TV viewing time.
tv_gender_indtv_gender_ind
A tibble with 320 rows and 2 variables:
Gender of the surveyed person.
Grouped weekly TV viewing time.
The dataset contains a contingency table on member satisfaction with a new agreement for two unions (plus totals).
union_satisfactionunion_satisfaction
A tibble with 3 rows and 6 variables:
Union identifier and total row.
Count of very satisfied respondents.
Count of satisfied respondents.
Count of dissatisfied respondents.
Count of very dissatisfied respondents.
Row total.
The dataset contains the same information as union_satisfaction, expanded to individual-level observations with one row per surveyed person.
union_satisfaction_indunion_satisfaction_ind
A tibble with 1,000 rows and 2 variables:
Union identifier.
Satisfaction category for the agreement.
Probability function for weekly units sold of utility vehicles.
utility_vehicle_sales_weeklyutility_vehicle_sales_weekly
A tibble with 6 rows and 2 variables:
Number of vehicles sold per week.
Probability mass.
Estimated coefficients from a multiple linear regression for vehicle range.
vehicle_range_coefficientsvehicle_range_coefficients
A tibble with 4 rows and 2 variables:
Regressor name.
Estimated coefficient.
The dataset contains grouped dwell-time frequencies from a sample of 100 website users.
website_dwellwebsite_dwell
A tibble with 5 rows and 2 variables:
Interval of dwell time in minutes.
Number of persons in the interval.
Expanded version of website_dwell with one row per sampled person.
website_dwell_indwebsite_dwell_ind
A tibble with 100 rows and 1 variable:
Interval of dwell time in minutes for the person.
Probability function for hourly number of cars serviced in a repair shop.
workshop_cars_per_hourworkshop_cars_per_hour
A tibble with 3 rows and 2 variables:
Number of cars serviced per hour.
Probability mass.
Seven observations with a symmetric quadratic relationship between x and y.
xy_quadraticxy_quadratic
A tibble with 7 rows and 3 variables:
Observation index.
Value of X.
Value of Y.