Wednesday, April 25, 2012

Tested software agent server with Apache Solr trunk (4.x)

Last week I made some enhancements to the new Base Technology Software Agent Server and tested it accessing the recently released Apache Solr 3.6 enterprise search platform. My test was quite simple, one test that added a few documents to a Solr collection, and the other test performing a few queries of that collection, all via the HTTP protocol, using XML to send data and receive results.
 
Earlier this week I downloaded the latest "nightly trunk build" for the next generation of Solr, referred to simply as "Solr trunk" or "4.x". My tests from Solr 3.6 worked fine except for one test case that checked the raw XML text and there one one nuance of difference - in 3.6 a zero-result query generates an XML empty-element tag for the "result" element, but in Solr 4.x a start tag and separate end tag are generated. No big deal.
 
As alluded to last week, I added the option to disable "writing" to the web (HTTP POST.) This option defaults to "disabled", which is safest. You need to set the "implicitly_deny_web_write_access" property to "false" in the agentserver.properties file in order to send documents to Solr from an agent running in the software agent server, but this is not needed if you are simply trying to query an already indexed document collection, which is most of what I was interested in anyway. Having the ability for an agent to actually add documents to Solr was simply an added benefit.

Thursday, April 19, 2012

Tested software agent server with Solr 3.6

I just ran a couple of simple tests to see how well the Base Technology software agent server could connect to Apache Solr 3.6 (open source enterprise search platform) which was just released last week. I did have to make a few changes to the agent server code, to add support for the HTTP POST verb and to permit HTTP GET to bypass the web page cache manager of the agent server.
 
Originally, I was going to access Solr via the SolrJ interface (Solr for Java), but I figured I would start with direct HTTP access to see how bad it would be. It wasn't so bad at all. I may still add support for SolrJ, but one downside is that it wouldn't be subject to the same administrative web access controls that normal HTTP access is. I'll have to think about it some more, but I could probably encapsulate the various SolrJ methods as if they were the comparable HTTP access verbs (GET for query, POST for adding documents, etc.) so that the administrative controls would work just as well with SolrJ. At least that's the theory.
 
For now, at least I verified that a software agent can easily add documents to and query a Solr server running Solr 3.6.
 
The code changes are already up on GitHub.
 
I do need to add a new option, "enable_writable_web", which permits agents to do more than just GET from the web. I had held off on implementing POST since it is one thing to permit agents to read from the web, but permitting them to write to the web is a big step that adds some risk for rogue and buggy agents. For example, with one POST command you can delete all documents from a Solr server. Powerful, yes, dangerous, also yes.
 
I also need to make "enable_writable_web" a per-user and even per-agent option so that an agent server administrator can allow only some users or agents to have write access to the web. There will probably be two global settings for the server, one for the default for all users, and one which controls whether any users can ever have write access to the web. The goal is to make the agent server as safe as possible by default, but to allow convenient access when needed and acceptable as well.
 
Unfortunately, after all of that, it turns out that Solr has a "stream.body" feature that allow documents to be added and deleted using an HTTP GET verb. Oh well, that's life. You can't cover all bases all of the time.

Wednesday, April 4, 2012

Boolean vs. binary

I do need to add one important caveat to my comments about the diminished value of binary in modern computer science, which is to draw a distinction between boolean logic and binary coding. Without a doubt, the AND, OR, and NOT operations of boolean logic are critical and absolutely essential to even the most modern of computer programming languages. But, boolean logic in modern computer software is about the two values "true" and "false" rather than the numeric encodings of "1" and "0." Besides, I doubt that anyone thinks of any typical data format as a sequence of "boolean" values.
 
Put another way, boolean logic provided the theoretical foundation for computing upon which the binary implementations of digital computers have been based. But, the boolean values of modern computer software are in no way dependent on the binary implementation of integers, floating point, character codes, etc.

-- Jack Krupansky

Monday, April 2, 2012

Yes, we still need hex, but I still can't see much need for binary

A reader commented that there are still places where ever web developers need hex – for color codes. I agree. There are also some character codes needed as well, although in HTML/XML they are typically in decimal rather than hex. But, clearly, hex is still needed.
 
So, I think the reader proved my point that it is hex that is important, not so much binary itself. Yes, students need to be able to count from 0 to F, but how often do they need to know the atcual bit encoding for "C" or "C136E9ABA6D8428DB935DF7BD587C0E6"? And, sure, some people do need to know about the bit-level details of Base64 encoding or cryptography and codecs, but how many out of every 1,000 software developers ever need to use actual 0 and 1 binary?
 
I can't even remember the last time I needed the "&"  (bitwise "AND") operator to "test" a bit or "mask" a bit-field. Probably not in the last 10 years. Not even 15 years. Maybe it was 20 years ago.
 
So, colors or character codes in hex, yes, but where can I find anybody using binary these days, other than in hardware or hardware interfaces and drivers?
 
Just to be clear, there are and will be an elite few computer scientists and advanced practitioners who really do need to be able to work and think at "the bit level", but their numbers are dwindling, I think.
 
A related question is whether the vast majority of "modern" software developers even need to know about "shifting" or "rotating" bits.
 
Again, all of this said, we may be stuck with the binary mentality until there is some major computational advance on the order of quantum computing or Ray Kurzweil's Singularity.

Do computer science students need to know about binary anymore?

Lately I have seen a couple of couple of cultural references to "binary" and "the 0's and 1's of computers/digital data", but just this morning I realized that it has been a very long time since I needed to know much at all about "binary" data. Sure, I need to know how many "bits" are used for a character or integer or float value, but that is mostly simply to know its range of values, not the linear 0/1 quality of the specific values. Sure, I've looked at hex data semi-frequently (e.g., a SHA), but even then the 0/1 aspect of "binary" is completely deemphasized and we might as well be working with hex-based computers as binary-based computers.
 
Sure, hardware engineers still need to know about binary data.
 
And, on rare occasion software developers do find use for "bit" fields, but even though that knowledge depends on more efficient storage of binary values, I'm sure a hex-based machine could implement bit fields virtually as efficiently. In any case, bit fields don't strictly depend on the computer being binary-based. How many "web programmers" or "database programmers" or even "java programmers" need even a rudimentary comprehension of "binary" data as opposed to range of values?
 
Besides, when data is "serialized" or "encoded", the nature of the original or destination machine implementation or storage scheme is completely irrelevant. Sure we use 8-bit and 16-bit "encodings" but those are really 256-value or 65,536-value encodings or 1-byte vs. 2-byte. And the distinction would certainly be irrelevant if the underlying computer has 256-value or 65,536-value computing units.
 
Granted, software designers designing character encoding schemes (or audio or other media encoding) need to "lay out the bits", but so few people are doing that these days. It seems a supreme waste of time and energy and resources to focus your average software professional on "the 1's and 0's of binary."
 
My hunch is that "binary" and "1's and 0's" will stick with us until the point where the underlying hardware implementation shifts from 1-bit binary to hex or byte-based units (or even double-byte units), and then maybe another 5 to 10 years after that transition, if not longer. After all, we still "dial" phone numbers even though it has probably been 25 years  or more since any of us had a phone "dial" in front of us, and certainly the younger generations never had that experience.

-- Jack Krupansky