Analytics Tools: Open Source vs Commercial
The use of open source tools such as Python and R is growing exponentially. However, there are commercial tools such as SAS have been around for literally half a century and they are strongly entrenched. For companies deciding on an analytic tool suite, what are they considerations that they will use to make their decision?
I agree with the writers above. Here is my addition: Open-source tools are free to use but if you need support, then you will have to buy it from one of several providers. The fact that you have a choice of provider may be an advantage. So please note that open-source does not mean free.
Generally, commercial tools are better in their interfaces. Not only the user interface but also the data interfaces for input and output. Open-source tools are generally better (and more up to date) in their processing capability as they have a much larger developer base. R specifically is used by university researchers and so it is always at the forefront of innovation. The downside of being on the forefront is that the code can sometimes exibit bugs or inefficiencies depending on who developed that method. So you have to be more careful with open-source.
The most important point is the user. A data science tool such as the ones you mention require a knowledgable user. Within that, open-source will require more knowledge and ability than a commercial solution. All data science methods will need parametrization for example. With most open source tools, it is expected that the user really knows what is going on and will set the values correctly. If you have not or cannot hire the right person, then the tool might not help you. As this holds for all such solutions, you may want to do what a lot of companies are doing ... outsource the project altogether to a company that has both the people and the tools to do it.
The responders prior have covered numerous key factors in deciding whether or not to use open-source or commercial statistical computing packages- those points I will not repeat here. One further consideration is data privacy and security. As well noted above, R and other open source software packages tend to be on the forefront of these technologies, but when you want to use a new or emerging method you frequently need to download a module/package (in the case of R) specific to the method. For non-academic, commercial enterprises, where data and results are potentially worth millions of dollars, this presents a cyber-security issue.
The decision has to consider multiple facets:
1. User Perspective: Who is going to use it? What would be their level of skills on analytical techniques and programming? What is the expertise the management has in terms of understanding the results of analysis? If the general answer to above is "intermediate", you are better off with proprietary tech stack so that you can start easily and realistically.
2. Problem and data perspective: What exact problem are you trying to solve? If it's really unique, say Text Summarization etc, you may not find advanced capabilities in current proprietary offerings. If your domain is sensitive and unique, it may be susceptible to known drawbacks/ limitations of open source implementation and hence, potentially dangerous to use. What time do you have to address the problem? How much data you have to solve the problem? If the window to create solution is small and large amount of data is involved, better how with proprietary tech.
I can think of many more perspectives such as available budget, infra requirements, existing company policies and industry regulations etc which can play role in the decision, but yes, I think this info should be enough for you to get started.
Cost of ownership including licence, major characteristic of the tools, support and further developments available (usually open source is open and you could find communities helping with this but you even couldn't and in this case you need your internal IT staff able to work with this). Sometimes when signing a contract for open source at the end you could have similar terms and conditions as with commercial solutions.,The licence costs differentiating between open source and commercial could not be quite as clear as assumed. You need to conduct a risk analysis based on these terms.
In the concrete case of R vs SAS, there can be additional factors. What about performance? R can perform at the best, even for a large dataset, but you have to do in the right way. For instance, parallel computing in R can be difficult (require tens of code lines) for some applications or models to train. Then, it is always the story about satellite functionalities. I mean, when you do a data analytics process, there are multiple things that you have to have in mind a part of the EDA or model training and validation. We have to assure the reproducibility of the process and keep save the data lifecycle.