med-mastodon.com is one of the many independent Mastodon servers you can use to participate in the fediverse.
Medical community on Mastodon

Administered by:

Server stats:

345
active users

#datasets

0 posts0 participants0 posts today

#PublicDomain #books #datasets #Harvard #AI

'"The rapid development and adoption of LLMs of varying quality has brought into focus the scarcity of publicly available, high-quality training data... To that end, this technical report introduces Institutional Books 1.0, a large collection of public domain books originally digitized through Harvard Library's participation in the Google Books project, beginning in 2006."'

dash.harvard.edu/entities/publ

dash.harvard.eduInstitutional Books 1.0: A 242B Token Dataset from Harvard Library's Collections, Refined for Accuracy and UsabilityLarge language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to train these models, or to support their work at inference time, have a direct impact on their quality. The rapid development and adoption of LLMs of varying quality has brought into focus the scarcity of publicly available, high-quality training data and revealed an urgent need to ground the stewardship of these datasets in sustainable practices with clear provenance chains. To that end, this technical report introduces Institutional Books 1.0, a large collection of public domain books originally digitized through Harvard Library's participation in the Google Books project, beginning in 2006. Working with Harvard Library, we extracted, analyzed, and processed these volumes into an extensively-documented dataset of historic texts. This analysis covers the entirety of Harvard Library's collection scanned as part of that project, originally spanning 1,075,899 volumes written in over 250 different languages for a total of approximately 250 billion tokens. As part of this initial release, the OCR-extracted text (original and post-processed) as well as the metadata (bibliographic, source, and generated) of the 983,004 volumes, or 242B tokens, identified as being in the public domain have been made available. This report describes this project's goals and methods as well as the results of the analyses we performed, all in service of making this historical collection more accessible and easier for humans and machines alike to filter, read and use.

Call For Manuscript Submissions - Real-Time GIS For Disaster Management
--
nature.com/collections/bjdhbfi <-- shared link to submission details
--
[note that I have NO affiliation with this journal, the guest editors, etc]
[I wonder if anybody from FEMA has compiled use case / effectiveness / robustness on/of the #WaffleHouseIndex in the southern USA, especially related to hurricanes?]
#GIS #paper #mapping #spatial #manuscripts #callforpapers #callformanuscripts #submissions #callforsubmissions #realtime #disaster #management #mitigation #prevention #preparedness #response #recovery #risk #hazard #naturalhazard #naturalhazard #emergency #remotesensing #earthobservation #satellite #drone #sensor #socialmedia #WaffleHouseIndex #datasets #AI #InternetOfThings #research #monitoring #evacuation #planning #resourceallocation #hazardmapping #realworld #global

Ready to supercharge your #OpenScience profile?

With #OpenAIREEXPLORE + @ORCID_Org , you can seamlessly complete your #ORCID record with all your research outputs, from papers & #datasets to #software tools.

Backed by the @OpenAIREGraph, EXPLORE identifies and matches your work, including:

-Journal articles
-Research data
-Software & more

Log in with your ORCID → check what’s missing → sync it to your profile in just a few clicks.

Read the article: explore.openaire.eu

From the Data Rescue Project: the Data Rescue Tracker. “The Data Rescue Tracker is a collaborative tool built to catalog existing public data rescue efforts so that we can coordinate better across initiatives. At this stage, you can use the tool to help reduce duplication of rescue efforts. The Data Rescue Tracker aims to provide a consolidated overview of who is backing up which dataset from […]

https://rbfirehose.com/2025/04/13/the-data-rescue-tracker/

ResearchBuzz: Firehose | Individual posts from ResearchBuzz · The Data Rescue Tracker | ResearchBuzz: Firehose
More from ResearchBuzz: Firehose

"Almost two dozen repositories of research and public health data supported by the National Institutes of Health are marked for “review” under the Trump administration’s direction, and researchers and archivists say the data is at risk of being lost forever if the repositories go down.

“The problem with archiving this data is that we can’t,” Lisa Chinn, Head of Research Data Services at the University of Chicago, told 404 Media. Unlike other government datasets or web pages, downloading or otherwise archiving NIH data often requires a Data Use Agreement between a researcher institution and the agency, and those agreements are carefully administered through a disclosure risk review process.

A message appeared at the top of multiple NIH websites last week that says: “This repository is under review for potential modification in compliance with Administration directives.”
Repositories with the message include archives of cancer imagery, Alzheimer’s disease research, sleep studies, HIV databases, and COVID-19 vaccination and mortality data."

404media.co/nih-archives-repos

404 Media · Massive, Unarchivable Datasets of Cancer, Covid, and Alzheimer's Research Could Be Lost ForeverDays before Robert F. Kennedy Jr. announced that 10,000 HHS staffers would lose their jobs, a message appeared on NIH research repository sites saying they were "under review."
#USA#Trump#Datasets

Axios: NOAA research websites slated to go dark get a reprieve.”NOAA has averted the early cancellation of an Amazon Web Services contract that would have caused a slew of agency websites to go dark beginning at midnight, the agency said Friday. Why it matters: The outages mainly would have affected NOAA’s research division, and would have made numerous websites and data sets inaccessible to […]

https://rbfirehose.com/2025/04/06/axios-noaa-research-websites-slated-to-go-dark-get-a-reprieve/

Massive, Unarchivable #Datasets of #Cancer, #Covid, #HIV and #Alzheimer's Research Could Be Lost Forever
Days before RFK announced 10,000 #HHS staffers would lose their jobs, a message appeared on #NIH research repository sites saying they were "under review." Unlike other government datasets or web pages, downloading or otherwise archiving NIH data often requires a Data Use Agreement between a researcher institution and the agency.
404media.co/nih-archives-repos
archive.ph/Y8asq

404 Media · Massive, Unarchivable Datasets of Cancer, Covid, and Alzheimer's Research Could Be Lost ForeverDays before Robert F. Kennedy Jr. announced that 10,000 HHS staffers would lose their jobs, a message appeared on NIH research repository sites saying they were "under review."

#ListenBrainz / #MetaBrainz I'm confused. Aren't sponsors the true customer? Why use this? 🤔

On one hand #Music: "Listen together", "Ethical forever"

On the other: #DATASETS

"Some of the world’s biggest platforms such as Google and Amazon, use our data"

"We ask commercial supporters to support us in order to help fund the creation and maintenance of these datasets."

"The following organizations make use of the data-sets published by MetaBrainz"

"Unicorn tier: #Google, #Amazon, #Spotify"

STAT: Gold-standard maternal mortality database in limbo as CDC staff placed on leave. “As part of the sweeping layoffs that rocked the Department of Health and Human Services on Tuesday, the entire staff that oversaw an annual survey to better understand infant and maternal health — and that was considered the gold standard in the field — was placed on administrative leave. The Pregnancy […]

https://rbfirehose.com/2025/04/02/stat-gold-standard-maternal-mortality-database-in-limbo-as-cdc-staff-placed-on-leave/

From handling massive #DataSets to streamlining delivery, UC Berkeley #Library is ensuring that #ResearchData is well-managed, accessible, and compliant with licensing agreements through #Dataverse, so resources are discoverable and usable by the entire university community. #RDM #DataManagement youtu.be/XVBUna3wzgk?si=c_Ixa-

This data may vanish under Trump, so we charted it
Some of most valuable #datasets in human history vanished from #US #government websites, felt like watching Library of Alexandria go up in smoke
Many have gone on record describing #Census Bureau’s #American Community Survey as wonder of modern world
Another loss? #HouseholdPulse survey, online survey that provided week-by-week data on income losses, economic struggles and precarious mental health
washingtonpost.com/business/20
archive.ph/mB512

The Washington Post · This data may vanish under Trump, so we charted itBy Andrew Van Dam

"On Friday, numerous essential #datasets were #purged from federal agency websites, including #data from #CDC PLACES (Population Level Analysis and Community Estimates), the Social Vulnerability Index (SVI), and the Climate and Economic Justice Screening Tool (CEJST)—to name just a few. While we don’t know when or if this data will return, we want to assure you that they are still accessible on our platform." policymap.com/blog/purged-fede #PolicyMap #PublicHealth #USPol #Project2025 #CivilRights

PolicyMap · Purged Federal Agency Data Available on PolicyMapOn Friday, numerous essential datasets were purged from federal agency websites, including data from CDC PLACES (Population Level Analysis and Community Estimates), the Social Vulnerability Index (SVI), and the Climate...

The Physical Sciences Data Infrastructure (PSDI) aims to simplify data management for researchers by integrating and enhancing existing #infrastructures.

It will enable seamless access to high-quality #data from both commercial and open sources, allowing researchers to combine #datasets, share software, models, and experimental or simulation data.