Examples in parentheses:
- Key-value store (Amazon DynamoDB)
- Column-family store (Cassandra)
- Document database (MongoDB)
- Graph database (Neo4j)
Examples in parentheses:
Dr. Michael Duffy of Iowa State University publishes them online. Next time I attempt working out the economics of alternative fuels, I should start there…
German animator Kaleb Lechowski recently released a short, but very well made, computer-generated sci-fi film. According to the author, the six-and-a-half-minute film took seven months to make, including two months spent on writing the script. The entire film crew consisted of three people: voice actor David Masterson, sound engineer Hartmut Zelle, and Kaleb himself, who did everything else.
It appears that the mythical man-month just took on a new meaning: one minute of computer-generated screen time…
Step 1: Find out which version of Java you’re running. The easy way to do this is through the Java Control Panel — if you can find it. Start by bringing up the Windows Control Panel (in Windows XP and Windows 7, choose Start, Control Panel; in Windows 8, right-click in the lower-left corner of the screen and choose Control Panel). If you see a Java icon, click on it. If you don’t see a Java icon (or link), in the upper-right corner, type
Java. If you then see a Java icon, click on it.
Unfortunately, there’s a bug in at least one of the recent Java installers that keeps the Java icon from being displayed inside Windows Control Panel. If you can’t find the Java icon, go to
C:\Program Files (x86)\Java\jre7\bin or
C:\Program Files\Java\jre7\bin and double-click on the file called javacpl.exe. One way or another, you should now see the Java Control Panel.
Step 2: Make sure you have Java Version 7 Update 11. In the Java Control Panel, under About, click the About button. The About Java dialog shows you the version number; if you’ve patched Java in the past few months, it’s likely Version 7 Update 9, 10, or 11. (Don’t be surprised if Java says that it’s set to update automatically, but doesn’t. I’ve seen that on several of my machines.) If you don’t have Java 7 Update 11, go to Java’s download site, and install the latest update. You have to restart your browser for the new Java version to kick in. Personally, I also reboot Windows.
Warning: Oracle, bless its pointed little pointy thingies, frequently tries to install additional garbage on your machine when you use its update site. Watch what you click.
Step 3: Decide if you want to turn off Java in all of your browsers. That’s certainly the safest choice, but some people have to use Java in their browsers from time to time. Personally, I don’t disable Java in all of my browsers (more about that in a moment).
Step 4: To turn off the Java Runtime in all of your browsers, from inside the Java Control Panel, click or tap on the Security tab, then deselect the box marked Enable Java Content in the Browser. Click or tap OK, and restart your browsers (or better yet, reboot). From that point on, the Java Runtime should be disabled in all of your browsers, all of the time. To bring Java back, repeat the steps and select the box marked Enable Java Content in the Browser (the setting should, in fact, say “Enable Java Content in All of Your Browsers”).
Step 5: If you don’t want to turn off Java in all of your browsers, choose the one browser you wish to leave Java-enabled. For me, that’s an easy choice: By default, recent versions of Chrome prompt before running Java on a specific page, so I turn off Java in all of my browsers except Chrome. That way I can use any of my browsers for general Internet work without fear of getting Javanicked. If I absolutely have to go to a website that requires Java, I’ll fire up Chrome specifically for that purpose.
Step 6: If you haven’t turned off Java in all of your browsers, turn off Java in each of your selected Java-free browsers. In Internet Explorer 9 or 10, click on the gear icon in the upper-right corner and choose Manage Add-Ons. Scroll down to the bottom, under Oracle America, Inc., select each of the entries in turn; they’ll probably say “Java(tm) Plug-In SSV Helper” or some such. In the lower-right corner click the button marked Disable. Restart IE. At the bottom of the screen, you’ll see a notice that says, “The ‘Java(tm) Plug-In SSV Helper’ add-on from ‘Oracle America, Inc.’ is ready to use.” Click Don’t Enable. If you get a second notice about a Java add-on, click Don’t Enable on it, too. That should permanently disable Java Runtime in IE.
In any recent version of Firefox, click the Firefox tab in the upper-left corner and choose Add-Ons. You should see an add-on for Java(TM) Platform SE 7 U11. Click once on the entry, and click Disable. Restart Firefox.
In Chrome, type
chrome://plugins in the address bar and push Enter. You should see an entry that says something like “Java (2 files) – Version: 10.7.2.11” Click on that entry and click the link that says Disable. Restart Chrome.
Step 7: Test. Make sure the browsers are/aren’t running Java, according to your wishes, by running each of them up against the Java test site. If you go to that site using Google Chrome, there better be a big yellow band at the top of your screen asking permission to run Java just this once.
Selectively disabling Java in your browsers isn’t particularly easy, but it’s a worthwhile step that everyone — absolutely everyone — should undertake. Right now.
Paul Krugman continues to ponder the impact of technology on income distribution. The takeaway, in my opinion, is here:
Smart machines may make higher GDP possible, but also reduce the demand for people — including smart people. So we could be looking at a society that grows ever richer, but in which all the gains in wealth accrue to whoever owns the robots.
A troubling prospect indeed…
…and, apparently, so did Paul Krugman:
Robots mean that labor costs don’t matter much, so you might as well locate in advanced countries with large markets and good infrastructure (which may soon not include us, but that’s another issue). On the other hand, it’s not good news for workers!
This is an old concern in economics; it’s “capital-biased technological change”, which tends to shift the distribution of income away from workers to the owners of capital.
Twenty years ago, when I was writing about globalization and inequality, capital bias didn’t look like a big issue; the major changes in income distribution had been among workers (when you include hedge fund managers and CEOs among the workers), rather than between labor and capital. So the academic literature focused almost exclusively on “skill bias”, supposedly explaining the rising college premium.
But the college premium hasn’t risen for a while. What has happened, on the other hand, is a notable shift in income away from labor:
If this is the wave of the future, it makes nonsense of just about all the conventional wisdom on reducing inequality. Better education won’t do much to reduce inequality if the big rewards simply go to those with the most assets. Creating an “opportunity society”, or whatever it is the likes of Paul Ryan etc. are selling this week, won’t do much if the most important asset you can have in life is, well, lots of assets inherited from your parents. And so on.
I think our eyes have been averted from the capital/labor dimension of inequality, for several reasons. It didn’t seem crucial back in the 1990s, and not enough people (me included!) have looked up to notice that things have changed. It has echoes of old-fashioned Marxism — which shouldn’t be a reason to ignore facts, but too often is. And it has really uncomfortable implications.
But I think we’d better start paying attention to those implications.
Indeed. And the case for some sort of redistributionist policy is getting stronger…
Mina Naguib is chasing down a weird network malfunction:
At AdGear Technologies Inc. where I work, ssh is king. We use it for management, monitoring, deployments, log file harvesting, even real-time event streaming. It’s solid, reliable, has all the predictability of a native unix tool, and just works.
Until one day, random cron emails started flowing about it not working.
Move over Jonathan Kellerman…
By Andrew Oliver
Created 2012-08-02 03:00AM
I’ve been in Chicago for the last few weeks setting up our first satellite office for my company. While Silicon Valley may be the home of big data vendors, Chicago is the home of the big data users and practitioners. So many people here “get it” that you could go to a packed meetup or big data event nearly every day of the week.
Big data events almost inevitably offer an introduction to NoSQL and why you can’t just keep everything in an RDBMS anymore. Right off the bat, much of your audience is in unfamiliar territory. There are several types of NoSQL databases and rational reasons to use them in different situations for different datasets. It’s much more complicated than tech industry marketing nonsense like “NoSQL = scale.”
Part of the reason there are so many different types of NoSQL databases lies in the CAP theorem, aka Brewer’s Theorem. The CAP theorem states you can provide only two out of the following three characteristics: consistency, availability, and partition tolerance. Different datasets and different runtime rules cause you to make different trade-offs. Different database technologies focus on different trade-offs. The complexity of the data and the scalability of the system also come into play.
[Full text here]
By Brian Proffit
Created 2012-06-21 03:00AM
The lure of using big data for your business is a strong one, and there is no brighter lure these days than Apache Hadoop, the scalable data storage platform that lies at the heart of many big data solutions.
But as attractive as Hadoop is, there is still a steep learning curve involved in understanding what role Hadoop can play for an organization, and how best to deploy it.
By understanding what Hadoop can, and can’t do, you can get a clearer picture of how it can best be implemented in your own data center or cloud. From there, best practices can be laid out for a Hadoop deployment.
What Hadoop can’t do
We’re not going to spend a lot of time on what Hadoop is, since that’s well covered in documentation and media sources. It’s important to know the two major components of Hadoop: The Hadoop distributed file system for storage, and the MapReduce framework that lets you perform batch analysis on whatever data you have stored within Hadoop. That data, notably, does not have to be structured, which makes Hadoop ideal for analyzing and working with data from sources like social media, documents, and graphs — anything that can’t easily fit within rows and columns.
That’s not to say you can’t use Hadoop for structured data. In fact, there are many solutions that take advantage of the relatively low storage expense per TB of Hadoop to simply store structured data there instead of a RDBMS (relational database system ). But if your storage needs are not all that great, then shifting data back and forth between Hadoop and an RDBMS would be overkill.
One area you would not want to use Hadoop for is transactional data. Transactional data, by its very nature, is highly complex, as a transaction on an e-commerce site can generate many steps that all have to be implemented quickly. That scenario is not at all ideal for Hadoop.
Nor would it be optimal for structured data sets that require very minimal latency, like when a website is served up by a MySQL database in a typical LAMP stack. That’s a speed requirement that Hadoop would poorly serve.
What Hadoop can do
Because of its batch processing, Hadoop should be deployed in situations like index building, pattern recognitions, creating recommendation engines, and sentiment analysis — all situations where data is generated at a high volume, stored in Hadoop, and queried at length later using MapReduce functions.
But this does not mean that Hadoop should replace existing elements within your data center. On the contrary, Hadoop should be integrated within your existing IT infrastructure in order to capitalize on the myriad pieces of data that flow into your organization.
Consider, for instance, a fairly typical non-Hadoop enterprise website that handles commercial transactions. According to Sarah Sproehnle, Director of Educational Services for Cloudera, the logs from one of their customer’s popular sites would undergo an ETL (extract, transform, and load) procedure on a nightly run that could take up to three hours before depositing the data in a data warehouse, at which time a stored procedure would be kicked off and (after another two hours) the cleansed data would reside in the data warehouse. The final data set, though, would only be a fifth of its original size — meaning that if there was any value to be gleaned from the entire original data set, it would be lost.
After Hadoop was integrated into this organization, things improved dramatically in terms of time and effort. Instead of undergoing an ETL operation, the log data from the Web servers was sent straight to the HDFS within Hadoop in its entirety. From there, the same cleansing procedure was performed on the log data, only now using MapReduce jobs. Once cleaned, the data was then sent to the data warehouse. But the operation was much faster, thanks to the removal of the ETL step and the speed of the MapReduce operation. And, all of the data was still being held within Hadoop — ready for any additional questions the site’s operators might come up with later.
This is a critical point to understand about Hadoop: It should never be thought of as a replacement for your existing infrastructure, but rather as a tool to augment your data management and storage capabilities. Using tools like Apache Flume, which can pull data from RDBMS to Hadoop and back, or Apache Sqoop, which can extract system logs in real time to Hadoop, you can connect your existing systems with Hadoop and have your data processed no matter the size. All you need to do is add nodes to Hadoop to handle the storage and the processing.
Required hardware and costs
So how much hardware are we talking?
Estimates for the hardware needed for Hadoop vary a bit, depending on who you ask. Cloudera’s list is detailed and specific on what a typical slave node for Hadoop should be:
Hortonworks, another Hadoop distributor, has similar specs, though it is a little more vague on the network stats, because of the varying workloads any given organization can apply to their Hadoop instance.
“As a rule of thumb, watch the ratio of network-to-computer cost and aim for network cost being somewhere around 20 percent of your total cost. Network costs should include your complete network, core switches, rack switches, any network cards needed, etc.,” wrote Hortonworks CTO Eric Baldeschwieler.
For its part, Cloudera estimates anywhere from $3,000 to $7,000 per node, depending on what you settle on for each node.
Sproehnle also outlined a fairly easy to follow rule-of-thumb for planning your Hadoop capacity. Because Hadoop is linearly scalable, you will increase your storage and processing power whenever you add a node. That makes planning straightforward.
If your data is growing by 1TB a month, for instance, then here’s how to plan: Hadoop replicates data three times, so you will need 3TB of raw storage space to accommodate the new terabyte. Allowing a little extra space (Sproehnle estimates 30 percent overhead) for processing operations of data, that puts the actual need at 4TB that month. If you’re using 4 X 1TB drive machines for your nodes, that’s 1 new node per month.
The nice thing is that all new nodes are immediately put to use when connected, getting you X times the processing and storage, where X is the number of nodes.
Installing and managing Hadoop nodes is not exactly nontrivial, though, but there are many tools out there that can help. Cloudera Manager, Apache Ambari (which is what Hortonworks uses for its management system), and the MapR Control System are all equally effective Hadoop cluster managers. If you are using a “pure” Apache Hadoop solution, you can also look at Platform Symphony MapReduce, StackIQ Rocks + Big Data, and Zettaset Data Platform third-party Hadoop management systems.
This is just the tip of the iceberg, of course, when it comes to deploying a Hadoop solution for your organization. Perhaps the biggest take-away is understanding that Hadoop is not meant to replace your current data infrastructure, only augment it.
Once this important distinction is made, it becomes easier to start thinking about how Hadoopp can help your organization, without ripping out the guts of your data processes.
Apparently, there’s a new trend in computing; people are beginning to build ultra-cheap and ultra-small computers running Android or Linux. Let’s meet a couple of the contestants…
Android PC runs Android 2.3 on a VIA 800MHz Processor, has 512MB of DDR3 memory, 2GB of NAND Flash storage, and quite a bit of connectivity:
The device supports 720p video.
Rikomagic MK802 boasts more substantial system resources (1GHz processor, 1GB of memory and 4GB of storage), runs Android 4.0, and has a built-in Wi-Fi, but offers somewhat more limited connectivity options: one HDMI port and one USB port. At the same time, it supports 1080p video.
Prices? $49 and $70, respectively…