Rob and Ralph
Thx for the explanation. The list is regarded as the most important member benefit. Few of our colleagues realize the work in it, but they see the value added. Great job. Thank you.
It's great to know that folks are still playing close attention to this ongoing effort. Just as background on this see http://www.sciencedirect.com/science/article/pii/S1871553213006026.Let me address the first quibble. The MD versus NJ tag was auto-generated by the logic routine I wrote to scan the article and try to guess where it may have happened. This is an exceedingly difficult task to achieve from an artificial intelligence (and hack programmer) perspective. Consider:1. Many on-line news sources do not use bylines (and they call themselves news!). And even when they do, they often omit the state, as this one clearly did.2. In fact, many times, the news site does not clearly identify which state the source itself is even in. In this case it does, but it's not immediately obvious.3.. Neither "New Jersey" nor "NJ" appears anywhere in the article itself.4. "Maryland" appears twice in the article.The logic routine starts out looking at the byline. If a state is found, it's assigned to that state (although other possible states will appear as alternates for Ralph to select). Then we look for the state in the text itself (it only looks at the text that Ralph has highlighted to include with the article headline). We also check to see where the domain name is from (relying on a built-in list of over 7,000 web/print outlets and 2,200 TV/radio outlets in the US plus a selection of influential International media outlets that I've collected over the years). If that's no good, we rely on the country name found in the text itself (if there is one, meaning it's an International incident). If that's no good, it looks for a state abbreviation in the domain name such as http://www.calepa.ca.gov (that .ca not be confused with Canada for international) If that's go good, we check the top level TLD of the domain (top level country reference such as .ca for Canada or .fr for France; there are 230 TLD's).It turns out that of the 122 media outlets listed for NJ, the Burlington County Times was not on the list. It is now, and the routine now auto-tags the article with New Jersey. I can also add city names to force state tags. For example, Chicago will usually force a tag to IL unless there is something better in the hierarchy. You have to be careful with that one, as some city names are fairly common - it would be pointless to try and use Springfield as a location tag trigger, for example. I have Dublin in there and it will trigger both California and Pennsylvania as well as Ireland as location suggestions. "Joint Base McGuire" is pretty definitive so I threw it into the list of city name triggers, too.So when Ralph runs the script (with the addition of the new web site and city name) it returns the following location assignments. Note that #2 and #3 would not have come up without the changes I just made:As Ralph had two choices of Maryland and Netherlands, I'll give him a pass on the mis-assignment here. He has to manually select the portions of each article he tags and does thousands upon thousands of these, so yeah, once in a while his caffeine-deprived eyes are not going to catch that. Which prompts a really big shout-out to Ralph for his thrice-weekly scanning through dozens of articles while most of us are still in bed so we have these summaries to read on MWF mornings!State: MarylandJoint Base McGuire suggests us_NJThe domain name is from New JerseyThe text contains Netherlands (indeed it does!)BTW, characterizing articles using algorithms also takes a lot of careful thought.. For example, if you have to classify the article into Discovery/Fire/Explosion/Release/Followup, you can't just look for the keyword "explosion" because the article might actually say "however, there was no explosion". To handle that I use regular expressions (regex); this one construction takes out most references that are false positives for an actual explosion. As you read this mess, recognize that ? means the character (or parenthetical construction) before the ? was optional, and | is a logical OR operator. The expressions for injury, death, and fire are 2-3x longer than this one!noWant = new RegExp("(not the result of|did not result in) an explosion|not explode|kept from exploding|(possibly|potentially|potential) explosive|fears of an explosion|risk of (an)?explosion|potential (for|to cause an) explosion|like an explosion|(could|might) cause an explosion|in case of explosion|(can|could|might) explode|(prone|susceptible|vulnerable) to explosion|in the event of (an )?explosion|controlled (explosion|detonation)|safely (exploded|detonated)|not (high |concentrated )?enough to cause an explosion|(can|could) (have)?(lead|led) to (an)?explo|blast phone message|explosive (chemical|material)|can be detonat|could have caused an explosion|rule(d)? out a(n| chemical) explosion|prevent(ed)? an explosion|blasting cap(s)?","gi");I'll mention one more short trigger there for followup because that's a hard one with false positives. First we look for language such as "last year", but that's usually not enough. So then we look for text containing a month and year that is not current (for example "March 2015 accident") because that kind of construction means it's probably a report on an earlier incident; other triggers are "Chemical Safety Board" and "in the wake of" etc:if (textSelected.search(/last (week|month|year)|(weeks|months|years) ago|(\bin|last|early|late|the) (January|February|March|April|May|June|July|August|September|October|November|December)( 20\d\d)?|\bin (2007|2008|2009|2010|2011|2012|2013|2014|2015)|the (\w)*( year)? anniversary|month investigation|investigatory findings|in the wake of the|final report|Chemical Safety Board/i) != -1) {typeResult=4;}All that said, the system works remarkably well. I add minor tweaks every few months at most now, and Ralph reports that the system has been working surprisingly well considering how little maintenance is now required. Then again, with so many articles over the years now, we've tweaked the algorithm enough times that it recognizes most stuff we throw at it. But it still takes Ralph to scan the articles, highlight the pertinent text for us, and manually review/tweak the tag suggestions. Uploading to Pinboard and composing the emails is essentially automated, but he still has to deal with Listserv problems, bounce messages and more. I can not overstate Ralph's time and mental commitment to this project as well as the DCHAS community (and beyond).As to quibble 2 - that's the news media. Par for the course.Rob Toreki======================================================Safety Emporium - Lab & Safety Supplies featuring brand namesyou know and trust. Visit us at http://www.SafetyEmporium.comesales**At_Symbol_Here**safetyemporium.com or toll-free: (866) 326-5412Fax: (856) 553-6154, PO Box 1003, Blackwood, NJ 08012
On Jul 8, 2016, at 6:46 PM, rosera**At_Symbol_Here**COMCAST.NET wrote:A couple of quibbles regarding the classifications associated with two of the Chemical Safety Headlines:FINAL PREPARATIONS UNDERWAY TO DESTROY CHEMICAL MUNITIONS FOUND ON JOINT BASE
Tags: us_MD, industrial, discovery, environmental, mustard_gas, phosgeneNote that while the destruction & disposal team is from Maryland, Fort Dix/Lakehurst where the chemical munitions were discovered & will be destroyed prior to disposal is most definitely in New Jersey.AIRBORNE HAZMAT LEAK AT EDISON PLANT THURSDAY MORNING
Tags: us_NJ, industrial, release, injury, dyeFirst, the article states it was a "whitening pigment", not a dye. From my personal knowledge of operations at this plant, it is undoubtedly titanium dioxide. The real chemical released, however, was almost certainly titanium tetrachloride, which when airborne forms titanium dioxide (visible) and hydrogen chloride gas (not so visible, but MUCH more hazardous). Interesting that the article (and the company spokesman?) conveniently neglects to mention this!Richard RoseraRosearray EHS Services LLCFrom: "Ralph Stuart" <ras2047**At_Symbol_Here**MED.CORNELL.EDU>
To: DCHAS-L**At_Symbol_Here**MED.CORNELL.EDU
Sent: Friday, July 8, 2016 10:12:19 AM
Subject: [DCHAS-L] Chemical Safety headlines from Google (14 articles)Chemical Safety Headlines From Google
Friday, July 8, 2016 at 8:06:24 AMA membership benefit of the ACS Division of Chemical Health and Safety
All article summaries and tags are archived at http://pinboard.in/u:dchasTable of Contents (14 articles)MH FIREFIGHTERS CLEAN UP CHEMICAL SPILL THURSDAY
Tags: us_AR, industrial, release, response, hydrochloric_acidHAZMAT CREWS RESPOND TO STRONG ODOR ON TEMPLE UNIVERSITY CAMPUS
Tags: us_PA, laboratory, release, response, unknown_chemicalAIRBORNE HAZMAT LEAK AT EDISON PLANT THURSDAY MORNING
Tags: us_NJ, industrial, release, injury, dyeHAZMAT INCIDENT IN MONROE CONTAINED
Tags: us_NY, public, discovery, response, nitric_acid, sodiumFINAL PREPARATIONS UNDERWAY TO DESTROY CHEMICAL MUNITIONS FOUND ON JOINT BASE
Tags: us_MD, industrial, discovery, environmental, mustard_gas, phosgeneDUPONT ORDERED TO PAY MILLIONS OVER TOXIC CHEMICAL EXPOSURE
Tags: us_OH, public, discovery, environmental, toxicsEXPLOSION INJURES THREE IN HARTFORD
Tags: us_VT, public, explosion, injury, other_chemicalNSF EMPLOYEE ASKS FOR INVESTIGATION OF ROTATING WORKER SYSTEM
Tags: public, discovery, environmentalFDA REQUESTS SAFETY DATA ON HAND SANITIZERS
Tags: industrial, discovery, environmental, drugs, ethanolREPORT: CHEMICAL SAFETY BOARD MUST CONDUCT MORE INVESTIGATIONS
Tags: industrial, discovery, environmental6 NEW JERSEY FIREFIGHTERS EXPOSED TO CHEMICALS, HOSPITALIZED
Tags: us_NJ, industrial, release, injury, unknown_chemicalSPRAIN REOPENS AFTER CHEMICAL SPILL
Tags: us_NY, transportation, release, injury, hydrochloric_acidNO ONE INJURED IN EXPLOSION AT INDUSTRIAL BUILDING
Tags: us_FL, industrial, explosion, response, unknown_chemicalHAWAII LAB EXPLOSION CAUSED BY STATIC DISCHARGE
Tags: us_HI, laboratory, follow-up, injury, biodiesel, hydrogen(snip)
Previous post | Top of Page | Next post