MegaRhyme's Wikinews Dataset

11/30/2022
Wikinews dataset logo

Overview

We extracted data from 21,403 articles on the English Wikinews site. As of September 30, 2022, that's roughly 98% of all of Wikinews's English articles. The data has been placed in JSON file and can be downloaded from this page.

Structure of the data

The JSON file contains an array, and each item in the array contains data for a particular article. Each items follows the structure shown below.

{
    "title": "Example title",
    "text": "Example article body\n\nParagraphs are separated by two newline characters.",
    "date": "yyyy-mm-dd",
    "categories": ["example_categories", "are all", "lower case"]
    }

Download

megarhyme-wikinews.json.zip

download size: 16.5 MB
unzipped size: 47.4 MB

The data at a glance

Extracting information from "unstructured data" doesn't always work perfectly. In some cases, data points are altogether missing from the source. However, for the vast majority of all articles, a datapoints were obtained for all fields. Below is a rough overview.

License

The Wikinews Dataset

licensed: Creative Commons Attribution 2.5 License

source: Wikimedia Dumps

author: various


The Wikinews Dataset Logo

licensed: Attribution-Share Alike 3.0 Unported License

source: Wikimedia Commons

author: Odder