This application report shows how to implement a FTP server example on TI KeyStone II 66AK2H device. The example is built with TI Processor SDK RTOS and NDK packages. While the note explains the software porting and development procedures, it focuses on methods how to tune up the network throughput. Those tuning methods are applicable to other TI processors and common TCP/IP networks in general.
Code Composer Studio™ is a trademark of Texas Instruments.
Arm® and Cortex® are registered trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere.
All trademarks are the property of their respective owners.
The Processor Software Development Kit (Processor SDK) provides the core foundation and building blocks that facilitate application software development on TI’s embedded processors. The package consists of device and platform dependent modules, core dependent software with source code and prebuilt libraries, TI Real-Time Operating System (RTOS) kernel, utilities and application examples. The TI Processor SDK RTOS is available as a free download with all components in one installer.
The Network Development Kit (NDK) https://software-dl.ti.com/processor-sdk-rtos/esd/docs/latest/rtos/index_Foundational_Components.html#ndk is a platform for development and demonstration of network enabled RTOS applications on TI processors and includes demonstration software showcasing capabilities across a range of network enabled applications.
The NDK provides an IPv6 and IPv4 compliant TCP/IP stack working with the TI-RTOS Kernel. It primarily focuses on providing the core Layer 3 and Layer 4 stack services along with additional higher-level network applications. The NDK itself does not include any platform or device specific software. The NDK interfaces through well-defined transport interface, Network Interface Management Unit (NIMU), to the PDK and platform software elements needed for operation.
The 66AK2Hxx (a.k.a. K2H) platform combines four Arm® Cortex®-A15 processors with up to eight TMS320C66x high-performance DSPs using the KeyStone II architecture. K2H device is optimal for embedded infrastructure applications like cloud computing, media processing, high-performance computing and so forth. The K2H has a five-port Gigabit Ethernet switch, with one port to host, and four Serial Gigabit Media Independent Interface (SGMII) ports connected to the networks.
The File Transfer Protocol (FTP) is a common network protocol used for file transfer between a client and a server structure. The FTP uses TCP protocol for reliable data transfer. As seen from the TI Processor SDK RTOS user guide: https://software-dl.ti.com/processor-sdk-rtos/esd/docs/latest/rtos/index_Foundational_Components.html#id140, an FTP server example is available to some processors, but not including K2H.
As a multi-core high performance processor, it is desired to have such an example on the K2H device for customer network application reference. Porting the existing one to K2H is expected to be straightforward. The note explains how to create a Code Composer Studio™ (CCS) project, port the code and test. Moreover, network performance is very important in real life. Such discussions may spread over the TI (E2E forums) or wider Internet communities. The application report examines how to optimize the network throughput from various approaches and summarizes what we learnt from this practice.
The K2H EVM (https://www2.advantech.com/Support/TI-EVM/EVMK2HX.aspx) is used for the FTP server development and test. The EVM provides connectivity for the four SGMII Gigabit Ethernet ports:
In the test, the SGMII port 0 is used by directly connecting to a Gigabit Ethernet port of a Windows 10 PC. The software configures the K2H with a fixed IP address 192.168.1.4 by default, the Windows PC has to be on the same subnet 192.168.1.xxx.
The software installer is downloaded from https://software-dl.ti.com/processor-sdk-rtos/esd/K2HK/latest/index_FDS.html. As time of the development, the latest 06_03_00_106 release is used. The installer contains all the packages necessary for the work, except that TI Code Composer Studio (CCS) needs to be downloaded separately. The packages used in the work are listed as follows:
A reference FTP server example called NIMU_FtpExample_<board>_<core>Exampleproject is available inside pdk_k2hk_4_0_xx. As explained in the PDK Example and Test Project Creation section in https://software-dl.ti.com/processor-sdk-rtos/esd/docs/latest/rtos/index_overview.html#rebuild-pdk, the CCS project can be created using batch script:
pdkProjectCreate [soc] [board] [endian] [module] [project type] [processor] [pdkDir]
where module is nimu.
The script searches through the pdk_k2hk_4_0_xx\packages\ti\transport\ndk\nimu (refer to as NIMU_INSTALL_DIR hereafter for brevity) folder for the existence of NIMU_FtpExample_<board>_<core>Exampleproject.txt to create the CCS project. The FTP example is available for several devices, like AM335x, AM437x, K2G and so forth, but not for K2H. The K2G is another KeyStone II processor with ARM A15 and C66x cores, the K2G project can be used as a reference for K2H because of the similarity between the two processors. As an example, file NIMU_INSTALL_DIR \example\ftpApp\k2g\armv7\bios\NIMU_FtpExample_evmK2G_armExampleproject.txt shows all the files, compiler and linker setups for the CCS project on Arm A15.
The K2G device has a FTP server example and a basic hello world example, while the K2H device only has the latter. One may compare the two K2G projects for differences, then apply those to the K2H using the hello world example as the template. The note illustrates the steps on how to create the Arm A15 project. The same applies to the C66x project.
As shown below between two .txt files for K2G:
It is evident that:
Keeping those differences in mind, one can create a FTP example project file for K2H, NIMU_INSTALL_DIR\example\ftpApp\k2h\armv7\bios\NIMU_FtpExample_EVMK2H_armExampleProject.txt, using NIMU_INSTALL_DIR\example\helloWorld\k2h\armv7\bios\NIMU_emacExample_EVMK2H_armBiosExampleProject.txt as the template. The text file shall have the following contents:
-ccs.linkFile "PDK_INSTALL_PATH/ti/transport/ndk/nimu/example/src/main_k2h.c"
-ccs.linkFile "PDK_INSTALL_PATH/ti/transport/ndk/nimu/example/ftpApp/ftpserver/ftp_commands.c"
-ccs.linkFile "PDK_INSTALL_PATH/ti/transport/ndk/nimu/example/ftpApp/ftpserver/ftp_filerout.c"
-ccs.linkFile "PDK_INSTALL_PATH/ti/transport/ndk/nimu/example/ftpApp/ftpserver/ftpserver.c"
-ccs.linkFile "PDK_INSTALL_PATH/ti/transport/ndk/nimu/example/helloWorld/src/nimu_osal.c"
-ccs.linkFile "PDK_INSTALL_PATH/ti/transport/ndk/nimu/example/helloWorld/src/nimu_cppi_qmss_iface.c"
-ccs.linkFile "PDK_INSTALL_PATH/ti/transport/ndk/nimu/example/helloWorld/src/nimu_pa_iface.c"
-ccs.linkFile "PDK_INSTALL_PATH/ti/transport/ndk/nimu/example/helloWorld/src/setuprm.c"
-ccs.linkFile "PDK_INSTALL_PATH/ti/drv/rm/device/k2h/policy_dsp-only.c"
-ccs.linkFile "PDK_INSTALL_PATH/ti/drv/rm/device/k2h/policy_dsp_arm.c"
-ccs.linkFile "PDK_INSTALL_PATH/ti/drv/rm/device/k2h/global-resource-list.c"
-ccs.linkFile "PDK_INSTALL_PATH/ti/transport/ndk/nimu/example/helloWorld/k2h/armv7/bios/helloWorld_ftp.cfg"
-ccs.setLinkerOptions " -lrdimon -lgcc -lm -lnosys -l:os.aa15fg -nostartfiles -static -Wl,--gc-sections -L${NDK_INSTALL_DIR}/packages/ti/ndk/os/lib"
-ccs.setCompilerOptions "-c -mcpu=cortex-a15 -mtune=cortex-a15 -marm -mfloat-abi=hard -DSOC_K2H -DDEVICE_K2H -DNIMU_FTP_APP -D_LITTLE_ENDIAN=1 -g -gstrict-dwarf -Wall -MMD -MP -I${PDK_INSTALL_PATH} -I{NDK_INSTALL_DIR}/packages" -rtsc.enableRtsc
where:
Then, the CCS project can be created by:
pdkProjectCreate K2H all little nimu all arm
.
Finally, one can import the project into CCS and build:
An out file called NIMU_FtpExample_EVMK2H_armExampleProject.out should be built. The example runs as the FTP server with a default IP address 192.168.1.4 on the K2H EVM. The PC acts as the FTP client with an IP address 192.168.1.11 as an example. The major test steps include:
A sample test screenshot shown in Figure 3-5, showing the PC test commands and FTP throughput reported. This proves the functionality of the K2H FTP server example. Note that the server does not really host a file system on the K2H EVM, get/put commands are simply memory operations.
As seen from previous section, the EVM transmitting speed is approximately 11MB/s and receiving speed is approximately 23MB/s. There should be rooms for improvement. This section discusses methods and debugging techniques to improve the Ethernet throughput.
Test result shows that 2,560,000 bytes data is transferred in a short duration. It is good to run a longer transfer with more data for measurement accuracy.
The transmitting is controlled inside ftp_filerout.c file:
int32_t ftp_filerout_read(io_handler_t *ioh, char *path)
{
……
for (i=0; i<5000;i++)
{
send(ioh->data_socket, ioh->DataBuf, DATA_BUFFER_SIZE, 0);
Task_yield();
}
.….
}
The DATA_BUFFER_SIZE is defined as 512. This is how 2,560,000 bytes (5,000 x 512) comes from. The packets captured on the wire show the packet size is 566 bytes, as a result of 54 bytes header (14 bytes Ethernet + 20 bytes IP + 20 bytes TCP) in addition to the 512 bytes data payload. This will not fully utilize the maximum Ethernet packet size which can be up to 1,514 bytes without fragmentation. One may update the DATA_BUFFER_SIZE to 1,460 (=1,514 – 54) and use a larger for() loop, e.g. 200,000 for test. This effectively transmits 292,000,000 bytes data with Maximum Transmission Unit (MTU) size.
Task_yield() is used to yield current task to other tasks with the same priority. It is unnecessary in such a simple test, so it can be commented out to tighten the for() loop.
One thing can be done to increase the throughput is to use the No-Copy APIs inside NDK. The regular socket APIs performs a CPU copy of the data received on the wire. On the contrary, the No-Copy APIs do not do that by merely passing the data pointer. You have to manually free the received buffer from APIs. For more information, see the Enhanced No-Copy Socket Operation section in the TI Network Developer's Kit (NDK) API Reference Guide.
Checking the FTP receiving function in the same ftp_filerout.c file:
int32_t ftp_filerout_write(io_handler_t *ioh, char *path)
{
…
bytesRead = (int)recvnc(ioh->data_socket, (void**)&pBuf, 0, &hBuffer);
…
}
The code already uses No-Copy API recvnc(), so no further change is needed.
The CCS project is created with –Og, which is optimized for debugging. This can be changed to –O3:
Those are simple checks one can do very quickly, the throughput is improved to approximately 23MB/s for transmitting and receiving directions after the change.
The default TCP transmitting and receiving buffer sizes are defined in ndk_3_xx_xx_xx\packages\ti\ndk\inc\stack\inc\resif.h with 8192 bytes for devices with small memory footprint:
#define DEF_SOCK_TCPTXBUF 8192
#define DEF_SOCK_TCPRXBUF 8192
#define DEF_SOCK_TCPRXLIMIT 8192
Those buffers can be increased in the FTP server main_k2h.c code before NC_NetStart() is called:
//
// This code sets up the TCP and UDP buffer sizes
// (Note 8192 is actually the default. This code is here to
// illustrate how the buffer and limit sizes are configured.)
//
// TCP Transmit buffer size
rc = 65536;
CfgAddEntry( hCfg, CFGTAG_IP, CFGITEM_IP_SOCKTCPTXBUF,
CFG_ADDMODE_UNIQUE, sizeof(uint32_t), (uint8_t *)&rc, 0 );
// TCP Receive buffer size (copy mode)
CfgAddEntry( hCfg, CFGTAG_IP, CFGITEM_IP_SOCKTCPRXBUF,
CFG_ADDMODE_UNIQUE, sizeof(uint32_t), (uint8_t *)&rc, 0 );
// TCP Receive limit (non-copy mode)
CfgAddEntry( hCfg, CFGTAG_IP, CFGITEM_IP_SOCKTCPRXLIMIT,
CFG_ADDMODE_UNIQUE, sizeof(uint32_t), (uint8_t *)&rc, 0 );
65,536 bytes is the maximum size for a TCP buffer. However, with such changes, you may see the FTP connection is refused while ping to the EVM is still responsive.
The most common reason is memory allocation failure. To check this, one can add print statement for failure code in socket() and accept() API calls inside ftpserver.c:
if ((listen_sock = socket (AF_INET, SOCK_STREAM, IPPROTO_TCP)) == INVALID_SOCKET)
{
printf("Failed to open listen socket with error %d\n", fdError());
}
if( (client_sock = accept(listen_sock, (struct sockaddr*)&client_addr, &len)) == INVALID_SOCKET)
{
/* Timeout, Wait a short time and try again */
printf("Failed to accept with error %d\n", fdError());
…
}
As expected, the fdError() returns NDK_ENOMEM (12) in one of the error conditions. So the NDK memory needs to be increased. If another number returns, one can translate that number with ndk_3_xx_xx_xx\packages\ti\ndk\inc\serrno.h.
TCP buffers are directly allocated from heap, while all other NDK modules allocate buffers from NDK memory table, which is also allocated from heap. The heap size is controlled by BIOS.heapSize inside the .cfg file, increasing it from 0x40000 to 0x100000 resolves the FTP connection failure issue. The ROV view provides a snapshot of heap memory usage. The Ethernet throughput is improved to 50-60MB/s after increasing the TCP buffer sizes.
Before moving deeper into network setting and NIMU driver tuning, it is good to understand if the throughput is limited by the CPU load, which is probably 100% busy? Such can be answered by using Unified Instrumentation Architecture (UIA) module for CPU load measurement. The major steps include:
BIOS.cpuFreq.lo = 1000000000; /*1GHz CPU*/
var LoggingSetup = xdc.useModule('ti.uia.sysbios.LoggingSetup');
LoggingSetup.loadLogging = true;
LoggingSetup.loadLoggerSize = 32768;
LoggingSetup.mainLoggerSize = 32768;
LoggingSetup.sysbiosLoggerSize = 32768;
LoggingSetup.sysbiosSemaphoreLogging = true;
LoggingSetup.loadTaskLogging = true;
LoggingSetup.loadSwiLogging = true;
LoggingSetup.loadHwiLogging = true;
LoggingSetup.enableTaskProfiler = true;
LoggingSetup.eventUploadMode = LoggingSetup.UploadMode_JTAGSTOPMODE;
Wireshark is a good tool to capture and analyze packet traffic for further insight. A usage example is illustrated below for a long data transfer from the K2H EVM to PC with throughput 45MB/s.
Through TCP stream analysis, it shows the TCP window size is always 64KB, no scaling happened.
The stream analysis also shows Round Trip Time (RTT) is about 0.60 millisecond mostly:
With this RTT and TCP window size, the transfer supports: 64KB/0.6 ms = 107MB/s throughput. The TCP window size is big enough even without scaling, this is not a factor limiting the throughput. Nevertheless, it is worthy to check and make sure the TCP window scaling is enabled on the Windows PC side, by using command: netsh interface tcp show global. A larger TCP window helps to improve the throughput especially in high RTT scenario.
As shown from the packet capture, the data transfer size is 1514 bytes from the K2H EVM to the PC. The PC sends an ACK packet for every 2920 bytes (2 packets) received by examining the successive ACK packets’ acknowledge number field:
The frequent reception of TCP ACK packets drives up the CPU usage during receiving interrupt handling. The way to reduce the interrupt generation and hence the interrupt handling is called interrupt coalescing. It prevents interrupts from being raised to a CPU until a specific amount of events are pending. The CPU is relaxed and the throughput can be potentially increased with fewer interrupts.
Interrupt coalescing can be explored on either side, or both sides. From the Windows 10 PC side, if the generated TCP ACK packets can be suppressed to less frequent, then the K2H CPU saves the power for receiving interrupt handling. The interrupt coalescing is configured in the Windows 10 registry under Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces{Interface_name}, one can use regedit utility to add two additional fields: TcpAckFrequency and TcpDelAckTicks. An ACK packet is transmitted every receiving number of TcpAckFrequency packets or when TcpDelAckTicks (in unit of 100 millisecond) delay timer expires, whichever happens first. An example is depicted below:
Those numbers are based on trials and the PC needs to reboot for the setting to take into effect. With TcpAckFrequency = 10 and TcpDelAckTicks = 1, the ACK is now transmitted for every 10 (from 2) received packets:
In the K2H device, Packet Accelerator (PA) is one of the main components of the network coprocessor peripheral. The PA is a firmware that works with the Gigabit Ethernet switch subsystem to perform packet processing operations such as packet header classification, checksum generation, and multi-queue routing. On the other hand, the NDK software supports both Layer 3 and Layer 4 checksums calculation. If this can be offloaded to the K2H’s PA, then the CPU processing power can be saved.
Current NDK design abstracts at NIMU layer. NDK is device independent without the knowledge of underlying hardware capabilities. In the receiving path, the PA does the packet classification at Layer 2 then handles the packets to the NDK stack. In the transmitting path, the NDK forms the packets and pass them to the NIMU driver. All the Layer 3 and Layer 4 are handled by the NDK software, the checksum calculation can’t be offloaded to the PA firmware.
The NIMU driver source code is NIMU_INSTALL_DIR\src\v2\nimu_eth.c. The transmitting function is EmacSend() and the receiving function is EmacRxPktISR(). There is a debug macro TIMING can be enabled to profile the CPU cycles spent in transmitting and receiving routines. The timestamp function is nimuUtilReadTime32(). One may further expand this debug to record additional cycle count information for each part of the code, and judge where efforts probably can be spent for driver code optimization.
For example, the transmitting packet number counter, the delta from last time EmacSend() is entered and total time spent in the EmacSend() are saved in a circular debug buffer for further analysis, as shown below. The EmacSend() function is found to take about 1550 to 2950 cycles (the third column).
By looking into the implementation of EmacSend(), the driver does transmitting return queue processing (in a loop over number of descriptors in gTxReturnQHnd), it pops the gTxReturnQHnd and it returns PBM packet back to the NDK stack, and pushes transmitting descriptor to the transmitting free queue, in addition to actually sending the packet out. After some code optimization, the cycle is reduced to about 1600 in the same routine, saving up to 45%.
Note that the NIMU library needs to be rebuilt and linked with the application for the changes taking into effect.