13 June 2009

A Week in the Life of a Desk Jockey Oceanography Tech


A2009129180500_Composited_map.hdf

A2009130184500L2_map.hdf


These images show an estimate of chlorophyll levels off the coast of the southeastern U.S., on a log scale from 0.1 to 10, at a resolution of 1 km per pixel. Yellow is high, blue is low. About half of my job description involves making and sorting satellite images like these. I have several thousand of them, dating as far back as late 1997, showing all sorts of things about the ocean just off the coast (another example: sea-surface temperature (SST)). (And just to be clear, these images aren't coming off of military spy satellites or anything impressive like that, they're not classified, most of the data is freely available off the Internet. Nobody dangerous cares about this kind of scientific data. :) )

Then one day my boss said: "Can you get an estimate of cloud cover?"

And I thought: cloud cover. All I have to do is count up all the pixels representing the clouds, divide by the total number of pixels, and voila - percent cloud cover! Should be simple!

Yeah.

1. Grabbing the pixels to be counted

The first thing I did was to draw a line around the area I wanted to count. In this case, I figured he wanted basically the entire South Atlantic Bight (from Cape Canaveral to Cape Hatteras (aside: the name is a bit misleading, since the area isn't actually in the southern Atlantic Ocean, it just happens to be south of the Mid-Atlantic Bight (Cape Hatteras to Cape Cod). Ahh, Northeastern USian centrism.)) - from the shoreline to the edge of the continental shelf.

Getting the offshore side was easy enough, since I had a dotted yellow contour line for 500m depth already, and I could just follow that. Land, however, was a bit trickier, since I wasn't about to try following the shoreline pixel-by-pixel by hand. And inconveniently, the (usually large) negative number representing clouds is the same as the negative number representing land, which means there's no way to separate the two.

On the other hand, the most recent versions of seadas like to guess wildly at what sea surface temperature might be under the clouds. This is officially called "interpolation" - but what actually happens is a mess of processing artifacts that look like oddly shaped cold spots in strange places. (There's a few in my above SST example, lower right of the image; I picked that particular image at the time for its lack of the artifacts, though.) Most of the time it's annoying, as we like nice, solid black clouds that look like obvious clouds, like in the second chlorophyll image up top. But for land vs. cloud separation purposes, having clouds pretending to be strangely cold patches of the sea worked in my favor.

I picked a reasonably clear SST image, overlapped my line onto land, let seadas fill the whole thing in, and ended up with a blotch of 141,290 coordinate points like so:

(as plotted on a bathymetry image)

Then I fed it to Matlab and told it to strip out all points representing land:

fid = fopen('polishedblotch.dat','w');
for i = 1:length(sst)
if sst(i) ~= -163.835
fprintf(fid, '%6.3f ',blotchtemplate(i, 1));
fprintf(fid, '%6.3f ',blotchtemplate(i, 2));
fprintf(fid, '%6.3f\n',blotchtemplate(i, 3));
end
end
fclose(fid);

... which gave me:

(blue = before, green = after)

2. Grabbing the actual data

Now that I had a blotch with all the coordinates I wanted to count, it was time to get some actual data. So I wrote some IDL:

fnames=FINDFILE('/75-char-filepath/*hdf')
cnt = N_ELEMENTS(fnames)

for ctr = 0, cnt-1 do begin & $
load, fnames(ctr), ftype = 'MAPPED', prod_name = ['Mapped - chlor_a', 'Mapped - chla'] & $
out_track, iband=1, ifile = 'polishedblotch.dat', ofile = STRMID(fnames(ctr), 75, 8) + 'blotch_chl.txt', /no_connect & $
clear_up & $
endfor

It takes in files with names like A2009130184500L2_map.hdf, and puts out files with names like A2009130blotch_chl.txt. Notice that I'm going for chlorophyll here instead of SST. This is because chlorophyll has real clouds. Also, out of all the possible things I could use, chlorophyll is one of the most basic measures, how it's measured and processed by seadas has changed the least in the past decade, and therefore it's the most consistent and extensive collection I have. (Also note: out_track is a seadas-specific function, not a general IDL function.)

Unfortunately, it turns out that IDL doesn't like looping through anything more than 32,000 times. And my blotch of coordinates contains 125,557 points (that's 125,557 square kilometers in the SAB). I saw suggestions to use "for ctr = 0L" instead of "for ctr = 0", and "fnames[ctr]" instead of "fnames(ctr)", but in the end, I used Matlab to break up the file into four parts:

blotch1 = polishedblotch(1:32000,:);
blotch2 = polishedblotch(32001:64000,:);
blotch3 = polishedblotch(64001:96000,:);
blotch4 = polishedblotch(96001:end,:);

fid2 = fopen('polishedblotch4.dat','w');
for i=1:length(blotch4)
fprintf(fid2, '%6.3f ',blotch4(i,1));
fprintf(fid2, '%6.3f ',blotch4(i,2));
fprintf(fid2, '%6.3f\n',blotch4(i,3));
end
fclose(fid2);

and added some more lines to the IDL script:

out_track, iband=1, ifile = 'polishedblotch2.dat', ofile = STRMID(fnames(ctr), 75, 8) + 'blotch_chl.txt', /no_connect, /append & $
out_track, iband=1, ifile = 'polishedblotch3.dat', ofile = STRMID(fnames(ctr), 75, 8) + 'blotch_chl.txt', /no_connect, /append & $
out_track, iband=1, ifile = 'polishedblotch4.dat', ofile = STRMID(fnames(ctr), 75, 8) + 'blotch_chl.txt', /no_connect, /append & $

... which amounted to the same thing.

3. Putting the data into the database

Now I had a bunch of coordinate points, and I had used it to extract some data. It was time to put all this into my PostgreSQL database!

Why would I want to do that, you ask? Why not just open it in Matlab again and do some simple division? Well, that would work great if all I had to do was a few images, this was the only time I'd have to do it, and then I'd never hear about this particular task again. However, if I'll eventually end up having to do several thousand of them spread throughout the decade, the excessive paperwork would eventually kill me.

It had also occurred to me that, as long as I'm extracting all this data to calculate percent cloud cover, why not save it somewhere easily accessible where I can use the data as actual data, too?

So I made a couple new tables. Blotchcoords would hold all the coordinate points, and then blotchdata would refer to it (plus several other tables I already had) and contain the actual data. And I wrote some PHP to get it all in. Most of it was to turn seadas's bizarre output format into something that the database could understand. First, to fill blotchcoords:

$counter = 0;
$success = 0;
while($datablock = fgets($file_handle)){
$counter++;
if (substr($datablock, 7, 5) == '( 1)') {
$lat = substr($datablock, 32, 6);
$lon = substr($datablock, 42, 7);
$value = substr($datablock, -21, 8);
$query = "insert into blotchcoords (coords_id, latitude, longitude, depth) values (nextval('blotchcoords_coords_id_seq'), $lat, $lon, $value)";
if (pg_query($dbconn, $query)) {$success++;}
} #endif
} #endwhile
echo "$counter lines processed, $success lines entered, for $file.\n\n";

It turns out that 125,557 database inserts takes a few hours. While I was waiting, I wrote the additions to put the actual data in; it's basically the same, except that the script first has to find the right reference in the coordinate table before doing the database insert.

$findcoord = "select coords_id from blotchcoords where latitude = $lat and longitude = $lon";
$out1=pg_query($dbconn, $findcoord);
$coords = pg_fetch_object($out1,0);
$query = "insert into blotchdata (blotch_id, value, date_id, prod_id, sat_id, coords_id) values (nextval('blotchdata_blotch_id_seq'), $value, $date->date_id, $prod, $sat, $coords->coords_id)";

It turns out that 125,557 reference lookups in addition to database inserts takes 12 hours. o.O I've since written up a shortcut; instead of looking up the references, now it just assumes that the coordinates are in the same order as the data, and counts up accordingly. Which is generally bad practice - never assume anything, ever - but in my case the coordinates data all went in in one go without stopping, so I'm pretty safe.

4. Calculate percent cloud cover

And finally, I could do the actual calculation! Here it is in SQL:

select ((select count(*)::numeric from blotchdata where value = -1 and date_id = 2314)/(select count(*)::numeric from blotchdata where date_id = 2314))*100 AS percentcloudcover;

(In the case of chlorophyll, clouds are -1.)

It turns out that for May 9, 2009 (top left), cloud cover was 5%, and for May 10, 2009 (top right) it was 35%.

:D

THE END

In the end, I've managed to use bits of every coding language I know at some point during the past week - and was comfy moving back and forth between them. And knew how to do the whole thing without asking for help from anyone (except for a bit at the end there to tweak the SQL). For someone who has no formal training in any of this stuff, and who was completely clueless six years ago, I think I'm doing pretty good. :)

5 comments:

vince said...

First, let me say that the geek in me absolutely loved this post. Actual science combined with actual programming - what more can a person want? :-)

Second, I think you're doing very well. While I have a computer science background, almost everything I do - from computer repair to programming to web design - is self-taught. The computer world changes rapidly, and while some things you learn are always applicable (how to design an algorithm, how to troubleshoot a problem be it hardware or software) the tools, the hardware and the software - especially the programming languages, are constantly changing. If you're an independent computer geek, you're constantly learning on your own.

Even for something like PHP, I'm continuously having to learn new things as it evolves in to a true object-oriented language.

Great post.

Random Michelle K said...

Apropos of nothing, has anyone mention that those pictures look kinda like mammograms?

Dr. Phil (Physics) said...

I'm sure some meteorological geek would have something to say, but I found your solution to be quite lovely -- and as usual, just a bit (!!) more complicated than the person asking the question thought it would be.

Ex-cellent.

Dr. Phil

MWT said...

Vince: Glad you enjoyed. I definitely wrote it with all the coding geeks in mind. ;)

Michelle: Jeri beat you to it the last time I posted some of these up. ;)

Janiece: Yep. ;)

I do a lot of muttering to myself, ranting at the computer, and pacing up and down the halls. It's why stuffing me into a cubicle in a centralized lobby would be bad bad bad... But the ending is nice, when it all comes together and I think I'm awesome for a few minutes. :D

MWT said...

Dr. Phil: just a bit!! o.O

But all of them are like that, and I've come to enjoy the challenge, so it's all good. ;)