SQL inserts from file failed at unknown point…what do I do?

I was running a nicely pieced together sql text file filled with INSERTs we’ll call ‘something.sql’ recently in a screen session. Unfortunately, when I reconnected much later my data import had failed at some unknown point without an error. Sigh.

Terminal Import I could see that many of the hundreds of thousands of rows had been imported, but how many? I needed to know where to restart the import and I did not want to delete all that data to restart it. That would be its own difficult reclamation project because I already had another import going.

Let’s pretend that I was importing comments for blog posts. The posts were all in the database already. So, how do you know where your import failed? I started with a guess.

 

I knew that another import on the same VM ran for about 5 hours before it completed and I was able to guesstimate that around 300,000 of the rows were probably inserted. The next steps composed a manual binary search. First, I wanted to know an upper bound on my search space.

awk ‘NR==300000 { print }’ something.sql

This shows me the 300,000th INSERT statement. It has an id I can use to check the database.

SELECT count(*) FROM comments WHERE  comments.post_id = <ID>

> 0

We now know that none of the comments for that post made it in and that we’re looking for some point earlier in the file. I decided to limit my search space.

head -n 300000 something.sql > 300k.txt

This creates a separate file with only the first 300k lines or INSERT statements. Now we can binary search, but with a little guess work. I’m fairly sure at least 200,000 made it in and I can test that assumption. I’ll reuse awk to show me the Nth line of the file.

awk ‘NR==200000 { print }’ something.sql

Turns out, that INSERT did make it in. Next up was checking line 245,000. It’s not there.

222, 000 is there. 233,000 is not. I kept on like this until I found a post that had 111 entries in the database. Was 111 the right number? We can grep the lines of the file for the post id and count the matches using wc.

cat 300k.txt | grep ‘<POST-ID>’ | wc -l

> 467

Now I know that it failed somewhere in those 467 INSERTS, but I don’t know where in the list I’m pointing to. I could be at any of those 467 INSERTs. I needed to know the line number of the first one in the file. Print out the line number of the first match for me, awk.

awk ‘/POST-ID/{ print NR; exit }’ 300k.txt

> 222,137

It failed between 222,137 and 222,604 (467 later). In fact, we know there were 111 INSERTs for that post so we’re pretty sure it failed on (222,137+111=) 222,248. We can verify with a few SQL queries similar to the one above and by verifying the comments we expect are present or not. Lastly, we need to restart that import:

tail -n +222,248 something.sql > missing.sql

psql -f missing.sql

tail helped us keep only the last lines we wanted and psql is off and running.

And that is that. It took two guesses, about 10 spot checks, and a few simple SQL queries to save hours.

Running Hadoop on Mac OS X Single Node Cluster

This guide will get you past the troubles of getting Hadoop installed and running on Mac OS X. I’ve tested it for Hadoop 2.2.x and OS X 10.9.

The meat of the process you need to follow is well documented by Michael Noll. I’m only going to add the steps here that are required over and above the Ubuntu guide for OS X. You should probably use brew unless you have a great reason not to.

in <HADOOP>/etc/hadoop/hadoop-env.sh add or edit these lines:

# The java implementation to use.

export JAVA_HOME=`/usr/libexec/java_home -v 1.6`

# Extra Java runtime options.  Empty by default.

export HADOOP_OPTS=”$HADOOP_OPTS -Djava.net.preferIPv4Stack=true”

export HADOOP_OPTS=”$HADOOP_OPTS  -Djava.awt.headless=true -Djava.security.krb5.realm=-Djava.security.krb5.kdc=”

YARN_OPTS=”$YARN_OPTS -Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk -Djava.awt.headless=true”

For x permissions problems over ssh

xhost +

Make sure you enable ssh using System Preferences > Sharing > Remote Login.

Starting Hadoop.

<HADOOP>/sbin/start-dfs.sh

<HADOOP>/sbin/start-yarn.sh

Running the WordCount Examples (download the files as in Michael’s guide).

bin/hdfs dfs -mkdir -p /user/hduser/gutenberg

bin/hdfs dfs -copyFromLocal -f /tmp/gutenberg/* /user/hduser/gutenberg

bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output

find /usr/local/hadoop -name hadoop*examples*.jar

Conversion Details with Brecht Palombo

I posted an interview with Brecht over at Keepify. He started and runs Distressed Pro, a sales intelligence tool for real estate brokers. The interview is a deep dive into the details of how he acquired early customers and what he does now to make conversions happen.

Also, check out the Bootstrapped with Kids podcast where he and Scott Yewell discuss the progress of their online business pursuits each week.

Pre-Launch Email List Building With Directories

I’ve recently been looking for directories that allow you to connect with prospective customers, build some links, and add exposure to an app pre-launch. I have found some similar lists elsewhere, but typically a lot of the sites in the list are spammy or gone forever.

This is a current list of sites I’ve found for building an email list before you launch. Leave me a comment to add or remove a site and I will keep this list current.

Some of these sites require your app to be launched or in beta, but most could accept it in either state. Quite of few of these sites will accept small to medium sized payments in exchange for an expedited review or posting. A few of these sites are explicitly about advertising.

Beta

  1. BetaLi.st
  2. StartupLi.st
  3. http://momb.socio-kybernetics.net/about
  4. http://www.startuptabs.com/

Launched

  1. http://web.appstorm.net/about/submit-an-app-for-review/
  2. http://www.go2web20.net/
  3. http://www.operation6fig.com/submit-a-start-up
  4. http://webapprater.com/submit-your-web-application-for-review.html

Beta or Launched

  1. http://helpareporterout.com
  2. http://startuptunes.com/
  3. http://techcrunch.com/
  4. http://www.crunchbase.com/
  5. http://www.psfk.com/
  6. http://thetechmap.com/
  7. http://allweb2.com/proposer-un-site/ (French)
  8. http://www.thehightechdirectory.com/
  9. http://boxyblogs.com/
  10. http://nibletz.com/category/start-ups/
  11. http://topecommercestartups.com/
  12. http://productivewebapps.com/submit/
  13. http://www.new-startups.com/submit
  14. http://doers.bz/
  15. http://aboutyourstartup.com/
  16. http://startupdirectory.com.au/submit-startup/
  17. http://www.venturebin.com/submit-venture/
  18. http://thestartuppitch.com/post-a-pitch/
  19. http://appuseful.com/app/add
  20. http://www.startupbooster.com/submit-site/
  21. http://readwrite.com/page/contact
  22. http://techli.com/contact/
  23. http://mashable.com/submit/
  24. http://www.cloudsurfing.com/newsite
  25. http://www.techpluto.com/submit-a-startup/
  26. http://startuplift.com/ (leans toward launched)
  27. http://www.appvita.com/
  28. http://alltop.com/submission/
  29. http://www.killerstartups.com/submit-startup/
  30. http://ratemystartup.com/submit-your-startup/
  31. http://www.inviteshare.com/
  32. http://hackerstreet.in/
  33. http://startupmeme.com/how-to-submit-your-startup-at-startup-meme/
  34. https://angel.co/public
  35. http://thestartupfoundry.com/tip-us/
  36. http://www.betakit.com/tips/ (Canada)
  37. http://startups.fm/contact-us
  38. http://web.startupstats.com/
  39. http://www.techhunger.com/submit-startup/
  40. http://netted.net/contact-us/
  41. http://webdevtwopointzero.com/submit-a-site/
  42. http://www.ontheapp.com/about/
  43. http://www.rev2.org/contact/#submitastartup
  44. http://www.launchingnext.com/
  45. http://apps400.com/submit-your-application-for-review
  46. http://www.appappeal.com/contact/advertise

SaaS/Software

  1. http://www.austinstartuplist.com/ (or your local one)
  2. http://www.capterra.com/vendors
  3. http://getapp.com
  4. http://feedmyapp.com
  5. http://saasdir.com
  6. http://saas-showplace.com
  7. http://www.cloudshowplace.com/add-your-company/
  8. http://www.cloudbook.net/directories/product-services/cloud-computing-directory
  9. http://www.moblized.com/

ADVERTISE

  1. http://www.makeuseof.com/advertise/
  2. http://muckrack.com/

Automate Submissions

  1. https://applaunch.us/pricing

Bold Action and Success

I’ve been reading everything I could about WWII since before I can remember deciding to do so. I watched hundreds of black and white documentaries and shorts about battles, tactics, strategy, and the experience of battle. I enjoy details the best.

What did it feel like to be at Bastogne? What was Halsey’s reasoning at Leyte? Why did Churchill order the British to fire on the French fleet in port at the outset of WWII? What makes a strategic move like an amphibious assault on an enemy’s rear or flank an extraordinary success in one case (Inchon) and near disaster in another (Anzio)?

I’ve read about the war and its details from generals, intelligence officers, politicians, leaders, historians, and grunts. In hundreds of stories about success I’ve seen one vital and common thread. Bold action. John Keegan, the Royal Military Historian at Sandhurst, wrote a great book about the experience of war throughout history from the experience of the rank soldier. Morale is the largest single division between success and rout.

Success in these stories nearly always involves a component of bold action taken with incomplete information and adjusted on the fly by the men in the mud. Sometimes the bold action is a brilliant offensive maneuver. Other times it is a smart retreat to reestablish the line on better terrain and over a smaller area. In other stories the boldness is simple defiance in the face of overwhelming odds — sometimes standing your ground is the boldest move of all. The action is usually unexpected, creative, and carries some real risk. Many of these gambles would never be run with complete information, but all of the successful ones are run with all the force, determination, and courage that the men could muster.

Churchill’s six tome historical account of WWII has an interesting section where he discusses the promotion and reassignment of various commanders. He reinforces that he wants to support failures by commanders that run risks because victory demands risks. Commanders that fear they will be replaced at the first failure by politicians will never take those kinds of risks. Churchill had to replace many commanders in North Africa, but never because they failed undertaking an intrepid campaign.

Like many of life’s lessons, this one is applicable in all sorts of places. Losing weight, getting out of debt, and building a business all qualify. Success demands bold action.

Mastering the Mobile Application Market with Patrick Thompson of Inkstone Mobile

I did an interview with Patrick on how he built his mobile app business last month after hearing him speak at MicroConf this year and speaking with him. Patrick has great insight into the mobile space and I’m really pleased he took some time out to speak with me.

Check out the interview.

MicroConf 2013

Jason Cohen led off the conference with a talk about building a “money machine” that brings in $10,000/mo. Jason crushed it. Rob closed the first day and spoke about taking HitTail from $1500/mo to well beyond the “money machine” mark in 20 months. Rob burned it down. Multiple tweets showed people changing pricing strategies, making sweeping copy changes, and saying the conference had already paid for itself within minutes or hours. Welcome to my retro-diary for MicroConf 2013. As usual, I’m going to cover material that most closely resonated with me and where I am.

Immediate Actionable Takeaway

  1. Use CPC = MRR/25 to work backwards from the CPC I can achieve in available channels and arrive at a pricing scheme for Keepify.
  2. Use money back guarantees and incentivized annual pre-pay from day one.
  3. Meet weekly with the family to plan work and leisure time. We have informal understandings, but planning would be an improvement.
  4. Plan out a marketing component that doesn’t scale and execute.

The Money Machine

Jason’s talk was a great look at the math and constraints behind building a business. He explained how you could piece together a cash machine from first principles similar to his recent post on CPC for bootstrapped business. He left it as, “Predictable acquisition of recurring revenue with annual pre-pay in a good market creates a cash machine.”

Giving incentives to customers to purchase annual pre-pay plans allows WPEngine to advertise with a much higher CAC and CPC. You can spend $300 dollars to acquire a customer that is prepaying for 10 months @ $49  per month right now.

Free trials can be eliminated in favor of a 60 money back guarantee. Use multiple plans and raise prices.

You need 150 customers to pay $66 / month on average. You can get 50 by scratching and clawing (see things that don’t scale) and 25 more with guest posts and social media. The final 75 can come from basic marketing all over a period of months.

Learning

Rob’s talk emphasized that it took him a period of 5 months building and 6 months learning before he began to scale the business. All that time learning was improving conversion rates, retention, copy, adding features, and increasing customer understanding. He is planning for similar learning period in the future and it is instructive to hear. I’ve experienced similar (but smaller scale) things recently with PPC ads. You have to be willing to spend a little money and stick it out through many revisions to be successful.

Teach

Nathan Berry, Brennan Dunn, and Hiten Shah all reinforced the necessity of using educational marketing as part of the customer acquisition process. Somewhere, Chet Holmes is proud.

Copywriting

Joanna Weibe of Copyhackers gave a great talk about copywriting. She said to minimize the visibility of free offering, use email, long-form sales pages, and start testing.

One point she made seemed particularly relevant to the audience. “Stop treating marketing as an experiment” which I understood to mean that you need to view things in the long term. Don’t give up when an initial ‘experiment’ in marketing doesn’t work.

Plugging Holes

Rob touched on his Operation Retention where he improved conversion and retention throughout the funnel for HitTail. He only resumed his marketing activities when those numbers became healthy again. He gave healthy numbers as 8% or less churn and 40-60% Trial to paid conversion.

Things that don’t scale

Rob, Josh Ledgard, Erica Douglass, Hiten Shah, Jason Cohen, and Patrick all stressed the importance of talking to customers. Especially early or after cancellation.

Erica included some case studies of offering one-on-one consulting for early customers and emailing 1000 people in a few months as examples of doing things that don’t scale to learn and market. Those 1000 emails were from Leo  Widrich of Buffer and converted into hundreds of guest blog posts and Buffer buttons on blogs.

Creating Channels

Hiten talked about creating your own channels to reach customers because established channels get crowded quickly. He gave examples of Nathan Berry, Brennan Dunn, Ruben Gamez, and KISSInsights which used a ‘Powered By’ link on the surveys to connect with customers.

Workaholics

I was part of an interesting hallway conversation about how to go from a job and working on your side pursuit into working normally. Many of the attendees spent some many years working 60 hour weeks that they didn’t know how to stop and enjoy freedom they had earned. It’s a an interesting subject that needs more direct treatment.

Sherry Walling gave a good talk that detailed how she and Rob made it through rough times and built relational systems and communication that evens out the ups and downs of entrepreneurship in a relationship. Another subject that became a topic of conversation in the hallways. Kids are common among attendees and everyone is looking to build a better future for their family.

Golden Handcuffs

Reflection on some of the conversation at MicroConf and a talk with my friend Evan forced me to consider the how a growing income and a growing family have changed the equation for what my minimum “money machine” looks like. Sherry’s talk included a quote about how cleaning services were cheaper than therapy and therapy was cheaper than divorce. I agree. I spoke with others about paying others for cleaning services or lawn care, but where do those things trade-off with growing expenses and lifestyle that tighten those golden handcuffs? It’s something for me to ponder.

Thanks to Rob and Mike for another great conference. I barely scratched the surface of the value available from attending MicroConf. There is simply too much for a blog post. You’ll have to join us for the 2.5 days next time.

You can wait for MicroConf 2014 or check out MicroConf Europe in October.

Pre-Launch Marketing

I’m thinking a lot about pre-launch for Keepify these days and I thought I’d organize it all here. There are a lot of different ways to try to collect traffic, attention, etc. Some of it is more effective before launching that others. Much of it depends on your product and market. Peldi from Balsalmiq has a famous and excellent post on launch marketing, but pitching bloggers on Keepify is a bit different than a mockup tool.

To demo and feel out the product requires data, time, and perhaps even effort. It’s hard to create a great experience without critical mass to show. The point is that what I decide to do may make no sense for you to do. Jason Cohen has a great post where he records honest thoughts on getting your first few customers. He points out that there is no formula. Every route has a champion and a detractor. You have to try a bunch of things and learn what works for you.

Brainstorm

I sat down recently and wrote a list of things I could do as marketing activities. I wanted to start by brainstorming before I cut things from the list.

Content

  • Guest post
  • Long tail SEO
  • Pillar Post Big SEO
  • Video
  • Infographics
  • HN/Reddit/similar placement
  • Testimonial or writeup
  • Squidoo/Hubspot
  • Webinars
  • Press Releases
  • Article sites
  • Craigslist Ads or Requests
  • Quora, Forums
  • Case Study Articles
  • Podcasts
  • WordPress Plugin

Permission Marketing

  • Mailing List
  • Lifecycle Email

Social Media

  • Facebook
  • Twitter
  • LinkedIn
  • Discount for sharing

Ads

  • LaunchBit
  • InfluAds
  • AdRoll
  • BingAds
  • Facebook Ads
  • LinkedIn Ads
  • BuySellAds

I reexamined 8 Ways to Build Pre-Launch Mailing List Episode 72 of Startups for the Rest of Us and I built the notes below.

  1. Use your audience (blog, podcast, etc.)
  2. SEO
  3. Infographic/Viral Content
  4. HN or equivalent
  5. Facebook Ads (affordable and working)
  6. Social Media Network
  7. Niche Ad Networks
  8. AdWords (last because it ain’t cheap)

I also checked in on 7 Catastrophically Common Launch Mistakes Episode 121 for the reverse perspective.

  1. No landing page before coding.
  2. Not tracking key metrics from the start (traffic sources, conversion rates)
  3. Relying on Word of Mouth (it isn’t really there)
  4. Open betas (be direct with early leads)
  5. One single launch email (do a sequence)
  6. Free plan or low price tier
  7. Slow growth (loss of interest)

My last stop for the podcast was Episode 122 4 B2B Strategies.

The strategies are Inbound, Outbound, Paid, and Partnership. Inbound is SEO, guest posts, infographics, and podcasts. Outbound is phone, email, direct mail, etc. Paid is various forms of advertising. Partnerships are joint venture deals. If you don’t have a big network or mailing list to trade, offer a revenue share.

Rob Walling also has a post on why you should start marketing on Day One which ties into the 7 Mistakes podcast above.

Pre-Launch Effectiveness

You don’t have anything to sell yet, but you want to get attention. You need beta users. You would like to have a list of people pre-purchase or at least sign-up for a launch list. You need to collect emails for people that come by and learn about you. It’s the most essential activity you can do for marketing right now.

You need a landing page with an email signup form. You probably also want a list to connect to on MailChimp or similar. You should install Google Analytics and at least use click tracking for conversions. This will really help when you have multiple traffic sources and you want to know how well you converted and from where. It also lets you compare ad network numbers to a baseline (though they may legitimately disagree).

Your landing page should probably be using some form of A/B testing. I found Optimizely to be affordable and easy to get started with. They do a very nice job of on-boarding and engineering the first-run experience. It’s almost worth signing up just to experience.

In the past I have resisted A/B testing for a simple landing page collecting email addresses with small amounts of traffic, but in reflection it was a catastrophic, arrogant, silly mistake. Don’t be me. Be the guy A/B testing. You can absolutely learn from A/B testing small traffic loads and you’ll be surprised how quickly executing these techniques can change your traffic outlook. So start now.

Most of the things listed above drive traffic to your landing page. It’s important that you do well with those activities, but the landing page is really critical. If you don’t know how to write copy it’s probably worth reading some good resources on copy. I’ve also read Ogilvy, The Copywriter’s Handbook, CopyBlogger, a headline book, and more. I’m starting to get a good feel for what should be in my copy, but I still frequently write tremendously bad copy. It’s a process. Every time. I’ve changed my conversion rate from < 10% to 33% with copy changes and A/B testing (It’s still not as good as it could be). Some variants are more than 50% better than an alternative. Would you like 50% more signups? Yeah. Do that.

You might be wondering what is working best for me. Some niche ads are doing well, but content has performed admirably and I’ve yet to really focus there. The truth is that I’m going to keep trying things up until it’s Launch Marketing. Many things only work after a launch (like joint venture deals) and I’ll have a better idea of what to keep doing, what to try, and how to structure my copy or content. I’ll write more on traffic strategies as I pursue new ones. Happy hunting.

Model Selection for SaaS Churn Prediction Using Machine Learning

This is a post in a series about churn and customer satisfaction. If you want churn prediction and management without more work, checkout Keepify. If you want more details, email away.

Recently I have been developing machine-learning systems that will predict SaaS churn. Churn prediction has many desirable business benefits and applications, but here I will focus on the technical details of selecting a durable model for predicting churn and some of the lessons I’ve learned along the way.

Beginnings

Most learning problems should be attacked initially with a linear model. I tried two versions of a linear approach in the early days. The first was an attempt to predict the number of months a user would stay using linear regression. This was a terrible failure. It was essentially 90% wrong. The root mean square error was absurdly high. I think this was the wrong approach with the wrong data, but it was a fun initial experiment to get some momentum. The last was an attempt to classify users as churners or non-churners using logistic regression. I’ll address that one more in the next few sections.

Literature Review

After my initial failure, I decided to fire up Google Scholar like my old days in graduate school and try to find some meaningful research on a similar subject. It turns out that a lot of subscription-based services like cable, Internet, and periodical publications fund both academic and industry research in churn prediction. There isn’t any apparent research on SaaS specifically, but the foundations of predicting churn for a newspaper subscription should be similar. In fact, I thought that SaaS should have far superior data to use in prediction.

The research says that the most successful models are Logistic Regression and Random Forests. Many people have shown the efficacy of Support Vector Machines to fall in between these popular options. Neural networks are another popular option with varied, but solid results [1]. My later experiments tried to use some of this insight and focus on models that had the most promise.

Experiments

I decided to use Weka to try a lot of different experiments quickly on the same data set. I was careful about separating my data into strict train and test segments, but I was happy use various datasets to experiment with different learning hypotheses. Weka performed beautifully for me and came with an additional benefit, the JVM. I was processing some of the data transformation in Ruby and I wanted to integrate this system into a Rails application. JRuby made working with Weka and Rails incredibly easy.

It was easy to transform my existing data to ARFF file formats for Weka and I managed to test out nearly all of the relevant classifiers that Weka supports. I have not used SVM or Neural Networks for reasons I explain in the next section. Bayesian Nets and AdaBoost show promise as classifiers for churn prediction in my experiments, but they don’t show up much in the literature.

Classifier Comparisons and Selection

Random Forests dominate the research landscape as the model of choice and my experiments bear that out. Random Forests win. A lot. The intuition to explain why is two-fold. Random Forests are extremely robust without performing feature selection. They do their own version of feature selection that works well for this problem. Random Forests are based on decision trees that classify data pretty well across a small number of known classes. They’re especially effective when certain feature values correlate highly with certain classes. Decision trees (and Logistic Regression) share a final benefit. They show how the classifier works internally in an understandable way. If your customers churn when they use feature X only once per month then you can see that in how the decision tree is structured. This is powerful insight.

Logistic Regression works really well, if not quite as well as Random Forests. It not only presents a model that explains how it works, but it does so with more emphasis on how sure it is whether a customer falls into one class or another.

I didn’t use Neural Networks in any experiments in large part because it isn’t something I could do out of the box with my data in Weka and it famously does not lend any insight into how the classification works. Neural Nets are a black box. Ideally, my classification engine for Keepify will be able to provide more insight for customers than classification alone.

Support Vector Machines are a very cool combination of linear classifiers that optimize a hyperplane. They are a sexy choice, but the performance is not quite so nice as Random Forests, they don’t show their work like Neural Nets, and they are really slow. I can generate predictions for thousands of customers with hundreds of features using a Random Forest in less than a few seconds. SVM might take minutes or worse.

In the end, I decided to use Random Forests and Logistic Regression. I do plan to experiment further with AdaBoost, however, as it is effective at eliminating bias from data sets that have classes with low prevalence.

[1] http://cjou.im.tku.edu.tw/bi2009/DM-usage.pdf