欢迎访问 生活随笔!

生活随笔

当前位置: 首页 >

使用Java解决您的数据科学问题

发布时间:2023/12/15 46 豆豆
生活随笔 收集整理的这篇文章主要介绍了 使用Java解决您的数据科学问题 小编觉得挺不错的,现在分享给大家,帮大家做个参考.

Java offers versatility, interoperability, and the chance to zip around Europe in a red vespa. Data scientists are expected to have good coding chops in Python (or R). I argue that Java should be added as a useful tool in the data science skillset.

Java提供了多功能性,互操作性,并有机会在红色的vespa中环游欧洲。 数据科学家有望在Python(或R)中拥有良好的编码能力。 我认为应该将Java添加为数据科学技能集中的有用工具。

Since 1995, Java has been sipping espresso and writing killer apps, occasionally sending a wistful glance out the cafe window. It’s probable there’s already a Java app that could help you with your work as a data scientist.

自1995年以来,Java一直在喝咖啡并编写杀手级应用程序,偶尔会在咖啡馆的窗户外望去。 可能已经有一个Java应用程序可以帮助您作为数据科学家工作。

This article will help you get a sense of the essentials, understand how Java could be used to support machine learning, and check out some resources for further learning. This article does not intend to teach you the actual syntax of this versatile language. This article will not offer to buy your espresso, but Java might.

本文将帮助您了解基本知识,了解如何使用Java支持机器学习,并查看一些资源以进一步学习。 本文无意教您这种通用语言的实际语法。 本文不会提供购买您的意式浓缩咖啡的信息,但Java可能会提供。

In this guide:

在本指南中:

  • What is Java?

    什么是Java?

  • Why learn Java?

    为什么要学习Java?

  • What are the core differences between Python and Java?

    Python和Java之间的核心区别是什么?

  • Summary

    摘要

  • Resources

    资源资源

什么是Java? (What is Java?)

Java is a general-purpose programming language that is class-based, object-oriented, and designed for interoperability. You could say Java has a devil-may care attitude when it comes to the deployment environment — an app written in Java can run on any operating system that can run the Java Virtual Machine (JVM). This flexibility is why Java is a favorite of app developers seeking to create a tool that is operating system agnostic and will transfer well to mobile.

Java是一种通用的编程语言,它是基于类,面向对象的并且旨在实现互操作性 。 您可能会说Java在部署环境方面表现出恶魔般的关心态度-用Java编写的应用程序可以在可以运行Java虚拟机(JVM)的任何操作系统上运行。 这种灵活性就是为什么Java成为寻求创建与操作系统无关的工具并将能很好地迁移到移动设备的应用程序开发人员的最爱的原因。

A java program’s source code (the .java file) is written in curly-brace heavy syntax. The program is compiled down to bytecode (a .class file) that can run on the JVM.

Java程序的源代码(.java文件)以大括号语法编写。 程序被编译为可以在JVM上运行的字节码(.class文件)。

A program is compiled from source code (e.g., pdfParsingApp.java) to bytecode (e.g., pdfParsingApp.class) by typing javac pdfParsingApp.java

通过键入javac pdfParsingApp.java ,将程序从源代码(例如pdfParsingApp.java)编译为字节码(例如pdfParsingApp.class)。

If you’re familiar with the concept of machine code from computer science fundamentals, java bytecode is analogous — but instead of being written for the specific architecture of the host computer, the bytecode is written for the virtual machine.

如果您从计算机科学的基础上熟悉机器代码的概念,那么Java字节码是类似的-但不是为主机的特定体系结构编写字节码,而是为虚拟机编写字节码。

My Tutorial Blog我的教程博客

The JVM translates the Java bytecode into the platform’s machine language — this is what gives Java its interoperability. The JVM is capable of translating from languages such as Clojure, Scala, and others that compile down to Java bytecode.

JVM将Java字节码转换为平台的机器语言-这就是Java的互操作性的原因。 JVM能够从Clojure,Scala等语言编译成Java字节码的语言进行翻译。

为什么要学习Java? (Why learn Java?)

While Python may be the lingua franca of the data science world (with R playing a role to a lesser extent), in terms of the broader developer community, Java is used by an estimated 40% programmers. Because of its widespread adoption, having at least a cursory understanding of how Java packages are structured, written, and implemented can help a data scientist communicate more effectively with their software engineering colleagues.

尽管Python可能是数据科学领域的通用语言 (R在其中起的作用较小),但就更广泛的开发人员社区而言, 估计有40%的程序员使用Java 。 由于Java包的广泛采用,至少对Java包的结构,编写和实现方式有一个粗略的了解可以帮助数据科学家与他们的软件工程同事更有效地沟通。

Moreover, understanding Java is helpful for the data acquisition and deployment phases of the machine learning pipeline.

此外,了解Java有助于机器学习管道的数据获取和部署阶段。

At the data acquisition stage, Java is helpful because production code bases are often written in Java. As a data scientist, the ability to go upstream to fix bad data before it enters the machine learning pipeline is invaluable. Behind SQL (used by 55% of programmers), Java is probably the most helpful language for addressing backend data issues and for cleaning up data.

在数据采集阶段,Java很有用,因为生产代码库通常是用Java编写的。 作为数据科学家,能够在数据进入机器学习管道之前上游修复不良数据的能力非常宝贵。 在SQL( 55%的程序员使用过的 SQL)之后,Java可能是解决后端数据问题和清理数据最有用的语言。

Backsliding from model development to data cleaning — this is what we’re trying to avoid through improved data quality via data engineering. Photo by Wilbur Wong on Unsplash.从模型开发到数据清理的后退—这是我们试图通过数据工程提高数据质量来避免的事情。 Wilbur Wong在Unsplash上的照片 。

For example, in natural language processing (NLP), we deal with a lot of unstructured text data that’s designed to be readable to humans but not to computers. Apache PDFBox is a Java library designed to address the challenge of wrangling text from a PDF into a usable format. Not only does this tool support content extraction, it can also be used to modify existing documents and create new ones. That would be a really cool way to report the output of your machine learning model built on text analytics.

例如,在自然语言处理(NLP)中,我们处理了许多非结构化的文本数据,这些数据被设计为人类可以读取的,但计算机却不可读。 Apache PDFBox是一个Java库,旨在解决将文本从PDF转换为可用格式的挑战。 该工具不仅支持内容提取,还可以用于修改现有文档并创建新文档。 这将是报告基于文本分析的机器学习模型的输出的一种非常酷的方法。

Speaking of deployment, Java affords a simple introduction into Scala. This super-fast language runs on the JVM and is often used by DevOps engineers to put machine learning models into production.

说到部署, Java对Scala进行了简单介绍 。 这种超快速的语言在JVM上运行,DevOps工程师经常使用该语言将机器学习模型投入生产。

The most powerful argument for learning Java is that it provides a strong foundation for improving your object-oriented programming. Computer science 101 classes are often taught with Java as the base language to provide instruction on data structures & algorithms. Because Java doesn’t offer as much abstraction as Python, it will improve your overall knowledge of computer science fundamentals.

学习Java的最有力依据是,它为改进面向对象的编程提供了坚实的基础。 通常以Java为基础语言教授计算机科学101课,以提供有关数据结构和算法的说明 。 因为Java提供的抽象不如Python,所以它将提高您对计算机科学基础知识的整体了解。

Bottom line: there is a net benefit for everyone — data scientists, software engineers, and the organization in which they operate — when data science is positioned squarely within the practice of computer science.

最重要的是:只要数据科学在计算机科学领域内处于立足之地,对每个人(数据科学家,软件工程师以及他们经营所在的组织)都有净收益。

Python和Java之间的核心区别是什么? (What are the core differences between Python and Java?)

Python is a dynamically typed, interpreted language. When writing Python code, the programmer doesn’t have to declare the type of each variable — the type is dynamic and changeable. At runtime, the interpret executes a Python program by translating each statement into a sequence of subroutines, then into machine code.

Python是一种动态类型的解释语言。 在编写Python代码时,程序员不必声明每个变量的类型-类型是动态的且可变的。 在运行时,解释程序通过将每个语句转换为一系列子例程,然后转换为机器代码来执行Python程序。

Java is a statically typed, compiled language. When creating a new variable in Java, the programmer is responsible for declaring the type. That type is then static — it cannot be changed. Static typing also means errors are caught in compilation, not at runtime. We’ve already talked about the role of the JVM in translating from intermediate Java bytecode (i.e., the .class file created by compilation) into machine code that is specific to the underlying computer hardware.

Java是一种静态类型的编译语言 。 在Java中创建新变量时,程序员负责声明类型。 然后,该类型是静态的-无法更改。 静态类型化还意味着错误会在编译中捕获,而不是在运行时捕获。 我们已经讨论了JVM在将中间Java字节码(即,通过编译创建的.class文件)转换为特定于基础计算机硬件的机器代码中的作用。

Etsy.Etsy 。

There are many more rules to working with Java than working with Python — both in terms of syntax (e.g. static typing) as well as package structure. For example, a Java source file can only contain one public class. The source file must be named for the public class it contains.

使用Java的规则要比使用Python的规则多得多-在语法(例如静态类型)以及包结构方面。 例如,一个Java源文件只能包含一个公共类。 必须为源文件包含的公共类命名。

For example, the program pdfParsingApp.java will look something like:

例如,程序pdfParsingApp.java将类似于:

The public static void main line must appear in every Java program (with the exception of programs that are part of a package library). This is the start of the main method that runs at execution and essentially tells the Java program what to do. Other classes within the program may provide information that is used by the main method.

public static void main 必须出现在每个Java程序中 (作为程序包库一部分的程序除外)。 这是在执行时运行的main方法的开始,实际上告诉了Java程序该怎么做。 程序中的其他类可能会提供主要方法使用的信息。

摘要 (Summary)

Learning Java is a good step toward getting comfortable with the idea that an enterprise machine learning pipeline is often composed of multiple languages. You might use technology related to Java and the JVM in order to gather data, clean data, and productionize your model. Plus, adding Java to your stack will improve your ability to write class-based, object-oriented programs.

学习Java是使企业机器学习管道通常由多种语言组成的想法的良好一步。 您可能会使用与Java和JVM相关的技术,以便收集数据,清理数据并生产模型。 另外,将Java添加到堆栈中将提高您编写基于类,面向对象的程序的能力。

Hopefully, this article gave you a bit of an intuition for how Java could be useful to your DS work. Using your familiarity with Python (or R), you can quickly build toward better understanding of Java — and better apps and model deployments.

希望本文能使您对Java如何对DS工作有用的直觉有所了解。 利用对Python(或R)的熟悉,您可以快速构建对Java的更好理解,以及更好的应用程序和模型部署。

资源资源 (Resources)

  • 6 Hours of Java Programming Tutorials by Caleb Curry

    Caleb Curry撰写的 6个小时的Java编程教程

  • Codecademy Java for building muscle memory

    Codecademy Java用于建立肌肉记忆

  • JetBrains Academy project-based approach to learning Java

    JetBrains Academy基于项目的 Java学习方法

If you enjoyed reading this article, follow me on Medium, LinkedIn, and Twitter for more ideas to advance your data science skills.

如果您喜欢阅读本文 ,请在Medium , LinkedIn和Twitter上关注我,以获取更多提高您的数据科学技能的想法。

翻译自: https://towardsdatascience.com/java-for-data-science-f64631fdda12

总结

以上是生活随笔为你收集整理的使用Java解决您的数据科学问题的全部内容,希望文章能够帮你解决所遇到的问题。

如果觉得生活随笔网站内容还不错,欢迎将生活随笔推荐给好友。