A2009129180500_Composited_map.hdf | A2009130184500L2_map.hdf |
These images show an estimate of chlorophyll levels off the coast of the southeastern U.S., on a log scale from 0.1 to 10, at a resolution of 1 km per pixel. Yellow is high, blue is low. About half of my job description involves making and sorting satellite images like these. I have several thousand of them, dating as far back as late 1997, showing all sorts of things about the ocean just off the coast (another example:
sea-surface temperature (SST)). (And just to be clear, these images aren't coming off of military spy satellites or anything impressive like that, they're not classified, most of the data is
freely available off the Internet. Nobody dangerous cares about this kind of scientific data. :) )
Then one day my boss said: "Can you get an estimate of cloud cover?"
And I thought: cloud cover. All I have to do is count up all the pixels representing the clouds, divide by the total number of pixels, and voila - percent cloud cover! Should be simple!
Yeah.
1. Grabbing the pixels to be countedThe first thing I did was to draw a line around the area I wanted to count. In this case, I figured he wanted basically the entire South Atlantic Bight (from Cape Canaveral to Cape Hatteras (aside: the name is a bit misleading, since the area isn't actually in the southern Atlantic Ocean, it just happens to be south of the Mid-Atlantic Bight (Cape Hatteras to Cape Cod). Ahh, Northeastern USian centrism.)) - from the shoreline to the edge of the continental shelf.
Getting the offshore side was easy enough, since I had a dotted yellow contour line for 500m depth already, and I could just follow that. Land, however, was a bit trickier, since I wasn't about to try following the shoreline pixel-by-pixel by hand. And inconveniently, the (usually large) negative number representing clouds is the same as the negative number representing land, which means there's no way to separate the two.
On the other hand, the most recent versions of
seadas like to guess wildly at what sea surface temperature might be
under the clouds. This is officially called "interpolation" - but what actually happens is a mess of processing artifacts that look like oddly shaped cold spots in strange places. (There's a few in my above SST example, lower right of the image; I picked that particular image at the time for its lack of the artifacts, though.) Most of the time it's annoying, as we like nice, solid black clouds that look like obvious clouds, like in the second chlorophyll image up top. But for land vs. cloud separation purposes, having clouds pretending to be strangely cold patches of the sea worked in my favor.
I picked a reasonably clear SST image, overlapped my line onto land, let seadas fill the whole thing in, and ended up with a blotch of 141,290 coordinate points like so:
(as plotted on a bathymetry image)Then I fed it to Matlab and told it to strip out all points representing land:
fid = fopen('polishedblotch.dat','w');
for i = 1:length(sst)
if sst(i) ~= -163.835
fprintf(fid, '%6.3f ',blotchtemplate(i, 1));
fprintf(fid, '%6.3f ',blotchtemplate(i, 2));
fprintf(fid, '%6.3f\n',blotchtemplate(i, 3));
end
end
fclose(fid);
... which gave me:
(blue = before, green = after)2. Grabbing the actual dataNow that I had a blotch with all the coordinates I wanted to count, it was time to get some actual data. So I wrote some IDL:
fnames=FINDFILE('/75-char-filepath/*hdf')
cnt = N_ELEMENTS(fnames)
for ctr = 0, cnt-1 do begin & $
load, fnames(ctr), ftype = 'MAPPED', prod_name = ['Mapped - chlor_a', 'Mapped - chla'] & $
out_track, iband=1, ifile = 'polishedblotch.dat', ofile = STRMID(fnames(ctr), 75, 8) + 'blotch_chl.txt', /no_connect & $
clear_up & $
endfor
It takes in files with names like A2009130184500L2_map.hdf, and puts out files with names like A2009130blotch_chl.txt. Notice that I'm going for chlorophyll here instead of SST. This is because chlorophyll has real clouds. Also, out of all the possible things I could use, chlorophyll is one of the most basic measures, how it's measured and processed by seadas has changed the least in the past decade, and therefore it's the most consistent and extensive collection I have. (Also note: out_track is a seadas-specific function, not a general IDL function.)
Unfortunately, it turns out that IDL doesn't like looping through anything more than 32,000 times. And my blotch of coordinates contains 125,557 points (that's 125,557 square kilometers in the SAB). I saw suggestions to use "for ctr = 0L" instead of "for ctr = 0", and "fnames[ctr]" instead of "fnames(ctr)", but in the end, I used Matlab to break up the file into four parts:
blotch1 = polishedblotch(1:32000,:);
blotch2 = polishedblotch(32001:64000,:);
blotch3 = polishedblotch(64001:96000,:);
blotch4 = polishedblotch(96001:end,:);
fid2 = fopen('polishedblotch4.dat','w');
for i=1:length(blotch4)
fprintf(fid2, '%6.3f ',blotch4(i,1));
fprintf(fid2, '%6.3f ',blotch4(i,2));
fprintf(fid2, '%6.3f\n',blotch4(i,3));
end
fclose(fid2);
and added some more lines to the IDL script:
out_track, iband=1, ifile = 'polishedblotch2.dat', ofile = STRMID(fnames(ctr), 75, 8) + 'blotch_chl.txt', /no_connect, /append & $
out_track, iband=1, ifile = 'polishedblotch3.dat', ofile = STRMID(fnames(ctr), 75, 8) + 'blotch_chl.txt', /no_connect, /append & $
out_track, iband=1, ifile = 'polishedblotch4.dat', ofile = STRMID(fnames(ctr), 75, 8) + 'blotch_chl.txt', /no_connect, /append & $
... which amounted to the same thing.
3. Putting the data into the databaseNow I had a bunch of coordinate points, and I had used it to extract some data. It was time to put all this into my PostgreSQL database!
Why would I want to do that, you ask? Why not just open it in Matlab again and do some simple division? Well, that would work great if all I had to do was a few images, this was the only time I'd have to do it, and then I'd never hear about this particular task again. However, if I'll eventually end up having to do several thousand of them spread throughout the decade, the excessive paperwork would eventually kill me.
It had also occurred to me that, as long as I'm extracting all this data to calculate percent cloud cover, why not save it somewhere easily accessible where I can use the data as actual data, too?
So I made a couple new tables. Blotchcoords would hold all the coordinate points, and then blotchdata would refer to it (plus several other tables I already had) and contain the actual data. And I wrote some PHP to get it all in. Most of it was to turn seadas's bizarre output format into something that the database could understand. First, to fill blotchcoords:
$counter = 0;
$success = 0;
while($datablock = fgets($file_handle)){
$counter++;
if (substr($datablock, 7, 5) == '( 1)') {
$lat = substr($datablock, 32, 6);
$lon = substr($datablock, 42, 7);
$value = substr($datablock, -21, 8);
$query = "insert into blotchcoords (coords_id, latitude, longitude, depth) values (nextval('blotchcoords_coords_id_seq'), $lat, $lon, $value)";
if (pg_query($dbconn, $query)) {$success++;}
} #endif
} #endwhile
echo "$counter lines processed, $success lines entered, for $file.\n\n";
It turns out that 125,557 database inserts takes a few hours. While I was waiting, I wrote the additions to put the actual data in; it's basically the same, except that the script first has to find the right reference in the coordinate table before doing the database insert.
$findcoord = "select coords_id from blotchcoords where latitude = $lat and longitude = $lon";
$out1=pg_query($dbconn, $findcoord);
$coords = pg_fetch_object($out1,0);
$query = "insert into blotchdata (blotch_id, value, date_id, prod_id, sat_id, coords_id) values (nextval('blotchdata_blotch_id_seq'), $value, $date->date_id, $prod, $sat, $coords->coords_id)";
It turns out that 125,557 reference lookups in addition to database inserts takes 12 hours. o.O I've since written up a shortcut; instead of looking up the references, now it just assumes that the coordinates are in the same order as the data, and counts up accordingly. Which is generally bad practice - never assume anything, ever - but in my case the coordinates data all went in in one go without stopping, so I'm pretty safe.
4. Calculate percent cloud coverAnd finally, I could do the actual calculation! Here it is in SQL:
select ((select count(*)::numeric from blotchdata where value = -1 and date_id = 2314)/(select count(*)::numeric from blotchdata where date_id = 2314))*100 AS percentcloudcover;
(In the case of chlorophyll, clouds are -1.)
It turns out that for May 9, 2009 (top left), cloud cover was 5%, and for May 10, 2009 (top right) it was 35%.
:D
THE ENDIn the end, I've managed to use bits of every coding language I know at some point during the past week - and was comfy moving back and forth between them. And knew how to do the whole thing without asking for help from anyone (except for a bit at the end there to tweak the SQL). For someone who has no formal training in any of this stuff, and who was completely clueless six years ago, I think I'm doing pretty good. :)